AND THE REASONS FOR ITS SUCCESS

Abstract

What is the simplest, but still effective, graph neural network (GNN) that we can ← All reviewers design, say, for node classification? Einstein said that we should "make everything as simple as possible, but not simpler." We rephrase it into the 'careful simplicity' principle: a carefully-designed simple model can outperform sophisticated ones in real-world tasks, where data are scarce, noisy, and spuriously correlated. Based on that principle, we propose SlenderGNN that exhibits four desirable properties: It is (a) accurate, winning or tying on 11 out of 13 real-world datasets; (b) robust, being the only one that handles all settings (heterophily, random structure, useless features, etc.); (c) fast and scalable, with up to 18× faster training in million-scale graphs; and (d) interpretable, thanks to the linearity and sparsity we impose. We explain the success of SlenderGNN via a systematic study on existing models, comprehensive sanity checks, and ablation studies on its design decisions.

1. INTRODUCTION

What is the simplest, and still performant, graph neural network (GNN) that we can design? GNNs (Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017) have succeeded in various graph mining tasks such as node classification, clustering, or link prediction. However, it is difficult for a practitioner to choose a proper model for each task without spending extensive time on searching, tuning, and training models due to a large number of GNN variants. Given all these variants, which one should a practitioner use first? Which are the strong and weak points of each variant? Could we design a variant that matches all of the strong points and avoids all the weak ones? In response to the questions above, we propose SlenderGNN based on the 'careful simplicity' prin-← All reviewers ciple: a simple, but carefully-designed model can be more accurate than complex ones due to better generalizability, robustness, and easier training. The design decisions of SlenderGNN (D1-4 in Section 4.2) are carefully made to follow this principle by observing and addressing the pain points of existing GNNs; for example, we generate various forms of graph-based features and combine them (D1), propose structural features (D2), remove redundancy in the generated features (D3), and make the propagator function contain no hyperparameters (D4). The resulting model, SlenderGNN, is our main contribution (C1) which exhibits the following desirable properties: • C1.1 -Accurate on both real-world and synthetic datasets, almost always winning or tying in the first place (see Figure 1b , Table 2 , and Table 3 ). • C1.2 -Robust, being able to handle numerous real settings such as homophily, heterophily, no network effects, graphs with useless features (see Figure 1a and Table 2 ). • C1.3 -Fast and scalable, using few, carefully chosen features, it takes only 32 seconds on million-scale graphs (ogbn-Products) on a stock server (see Figure 1b ). • C1.4 -Interpretable, learning the largest weights on informative features, ignoring noisy ones, based on the linear decision function (see Figure 2 ). The natural question that arises from the success of SlenderGNN is ← All reviewers Q: "How is it possible that a simpler model is more accurate than a sophisticated, more expressive one?" Our intuitive justification for the success of SlenderGNN is as follows: (a) Occam's razor: Since a statistical model tries to 'explain' the given labels, the simplest explanation performs best in general.  ✓ Structural Structural ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ SlenderGNN ✓ S 2 GC G 2 CN ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ GCN ✓ ✓ ✓ SAGE ✓ ✓ ✓ ✓ GCNII ✓ ✓ ✓ APPNP ✓ GPR ✓ ✓ ✓ ✓ GAT ✓ (a) SlenderGNN succeeds in all sanity checks, while none of the existing models does. The table is generated from the results of our actual experiments in Table 2 : ✓ means success (accuracy ≥ 80%).

18.0x Faster

(b) SlenderGNN wins both on accuracy and training time: it is represented as the red star in (left) ogbn-arXiv, (middle) ogbn-Products and (right) Pokec, which are large real-world graphs (1.2M, 61.9M and 30.6M edges, respectively). Several baselines run out of memory ('crossed out'). Figure 1 : SlenderGNN outperforms existing GNN models, is fast, and passes all sanity checks. See our main results for details (sanity checks in Section 5 and real-world experiments in Section 6). (b) Overfitting: Complex models often suffer from overfitting, exacerbated by the fact that training data are usually scarce and expensive. (c) Spurious correlations: Even with sufficient labeled data, a model trained on noisy features may latch on to spurious correlations, which hurt its performance; a simple model effectively suppresses such noisy features as our design decision D3. In addition to the intuitive arguments above, our extensive experiments provide hard evidence in favor of the 'careful simplicity' principle: SlenderGNN outperforms complex GNNs in both synthetic (in Table 2 ) and real-world (in Table 3 ) datasets, and even its own variants that use nonlinear feature transformation in 9 of the 13 real-world datasets (in Table 4 ). Not only we propose a carefully designed, high performing GNN, we also explain the reasons of its success. This is thanks to our two additional contributions (C2-3): • C2 -Explanation: We propose GNNLIN, a framework for the systematic linearization of existing GNNs. As shown in Table 1 and Section 3, our GNNLIN highlights the similarities, differences, strengths and weaknesses of successful GNN baselines. • C3 -Sanity checks: We propose a wide range of scenarios (homophily, heterophily, blockcommunities, bipartite-graph communities, etc.), which reveal the strong and weak points of each GNN variant: see Figure 1a with more details in Table 2 and Section 5. Reproducibility: Our code is available at https://bit.ly/3fhWJfK along with our datasets for 'sanity ← R-D5 checks' and real-world datasets of homophily and heterophily graphs.

2. PROBLEM DEFINITION AND RELATED WORKS

We introduce the problem definition of semi-supervised node classification, symbols frequently used in this paper, and related works on graph neural networks (GNN).

Problem definition

We define the problem of semi-supervised node classification as follows: • Given An undirected graph G = (A, X), where A ∈ R n×n is an adjacency matrix, X ∈ R n×d is a node feature matrix, n is the number of nodes, and d is the number of features • Given Labels y ∈ {1, • • • , c} m for m nodes, where m ≪ n, and c is the number of classes. • Predict the unknown classes of n -m test nodes in G. We use the following symbols to represent modified adjacency matrices. Ã = A+I is the adjacency matrix with self-loops. D = diag( Ã1 n×1 ) is the diagonal degree matrix of Ã, where 1 n×1 is the matrix of size n × 1 filled with ones. Ãsym = D-1/2 Ã D-1/2 is the symmetrically normalized Ã. Similarly, A sym = D -1/2 AD -1/2 is also the symmetrically normalized A but without self-loops. There are other types of normalization A row = D -1 A and A col = AD -1 (and accordingly Ãrow and Ãcol ), which we call row and column normalization, respectively, based on the position of the degree matrix. Refer to Appendix A for the table of symbols used frequently in this work. As a background, we formally define logistic regression (LR) as a function to find the weight matrix W that best maps given features to predicted labels by a linear function. Definition 1 (LR). Given a feature X ∈ R n×d and a label y ∈ R m , where m ≤ n is the number of observations, let Y ∈ R m×c be the one-hot representation of y, and y ij be the (i, j)-th element in Y. Then, logistic regression (LR) finds an optimal weight matrix W * ∈ R d×c as follows: LR(X, y) = arg max W l i=1 c j=1 y ij log ŷij where ŷij = exp(w ⊤ •j x i ) c k=1 exp(w ⊤ •k x i ) , and w •j is the j-th column of W. We omit the bias term for brevity without loss of generality.

2.1. RELATED WORKS

Graph neural networks There are many recent GNN variants; recent surveys (Zhou et al., 2020; Wu et al., 2021) group them into spectral models (Defferrard et al., 2016; Kipf & Welling, 2017) , sampling-based models (Hamilton et al., 2017; Ying et al., 2018) , attention-based models (Velickovic et al., 2018; Kim & Oh, 2021; Brody et al., 2022) , and deep models with residual connections (Li et al., 2019; Chen et al., 2020) . Decoupled models (Klicpera et al., 2019a; b; Chien et al., 2021) separate the two major functionalities of GNNs: the node-wise feature transformation and the propagation. GNNs are often fused with graphical inference (Yoo et al., 2019; Huang et al., 2021) . Linear graph neural networks Wu et al. (2019) proposed SGC by removing the nonlinear activation functions of GCN (Kipf & Welling, 2017) reducing the propagator function to a simple matrix multiplication. Wang et al. (2021) and Zhu & Koniusz (2021) improved SGC by manually adjusting the strength of self-loops with hyperparameters, increasing the number of propagation steps. Li et al. (2022) proposed G 2 CN, which improves the performance of DGC (Wang et al., 2021) on heterophily graphs by combining multiple propagation settings (i.e. bandwidths). The main limitation of these models is the high complexity of propagator functions with many hyperparameters, which impairs both the robustness and interpretability of decisions even with linearity. Graph kernel methods Traditional works on graph kernel methods (Smola & Kondor, 2003; Ioan-← R-V2 nidis et al., 2017) are closely related to linear GNNs, which can be understood as applying a linear graph kernel to transform the raw features. A notable limitation of such kernel methods is that they are not capable of addressing various scenarios of real-world graphs, such as heterophily graphs, as their motivation is to aggregate all information in the local neighborhood of each node, rather than ignoring noisy and useless ones. We implement three popular kernel methods as additional baselines and show that our SlenderGNN outperforms them in both synthetic and real graphs.

3. PROPOSED FRAMEWORK: GNNLIN

Why do GNNs work well when they do? In what cases will a GNN fail? We answer these questions with our proposed GNNLIN, which reveals the essence of each GNN variant. The idea is to derive the feature propagator function that each variant uses ignoring nonlinearity. Our observations help us do the careful design of our SlenderGNN, which we describe in Section 4. Definition 2 (Linearized GNN). Given a graph G = (A, X), let f (•; θ) be a node classifier function to predict the labels of all nodes in G as ŷ = f (A, X; θ), where θ is the set of learnable parameters. Then, f is linearized if θ = {W} and an optimal weight matrix W * ∈ R h×c is given as  W * = LR(P(A, X), y), X K DGC Linear [(1 -T /K)I + (T /K) Ãsym ] K X K, T S 2 GC Linear K k=1 (αI + (1 -α) Ãk sym )X K, α G 2 CN Linear ∥ N i=1 [I -(Ti/K)((bi -1)I + Asym) 2 ] K X K, N , Ti, bi PPNP* Decoupled (I -(1 -α) Ãsym ) -1 X α APPNP* Decoupled [ K-1 k=0 α(1 -α) k Ãk sym + (1 -α) K ÃK sym ]X K, α GDC* Decoupled SsymX where S = sparse ϵ ( ∞ k=0 (1 -α) k Ãk sym ) α, ϵ GPR-GNN* Decoupled ∥ K k=0 Ãk sym X K ChebNet* Coupled ∥ K-1 k=0 A k sym X K GCN* Coupled Same as SGC K SAGE* Coupled ∥ K k=0 A k row X K GCNII* Coupled ∥ K-2 k=0 Ãk sym X ∥ ((1 -α) ÃK sym + α ÃK-1 sym )X K, α GAT** Attention K k=1 [diag(Xw k,1 ) Ã + Ãdiag(Xw k,2 )]X K, w k,1 , w k,2 DA-GNN** Attention K k=0 diag( Ãk sym Xw) Ãk sym X K, w where P is a feature propagator function that is linear with X and contains no learnable parameters, and P(A, X) ∈ R n×h . We ignore the bias term for brevity without loss of generality. Definition 3 (GNNLIN). Let f (•; θ) be a (nonlinear) GNN. GNNLIN is to represent f as a linearized GNN by replacing all (nonlinear) activation functions in f with the identity function and deriving a variant f ′ that is at least as expressive as f but contains no parameters in P. GNNLIN represents the characteristic of a GNN as the linear feature propagation function P, which transforms raw features X by utilizing A. Lemma 1 shows that GNNLIN generalizes existing linear GNNs. Logistic regression is also represented by GNNLIN with the identity P(A, X) = X. Lemma 1. Our GNNLIN framework includes existing linear GNN models as its special cases: SGC, DGC, S 2 GC, and G 2 CN. Proof. The proof is given in Appendix B. ■

3.1. LESSONS FROM GNNLIN

Table 1 shows the linearized form of existing GNNs generated from our GNNLIN framework. Refer to Appendix C for the detailed information. Based on the result, we spot the fundamental similarities and differences among the GNN variants. There are three distinguishing factors. Distinguishing Factor 1 (Combination of features). How should we combine the node features, the immediate neighbors' features, and the K-step-away neighbors' features? GNNs propagate information by multiplying the feature X with (a variant of) the adjacency matrix A multiple times. There are two main choices in Table 1: (1) summation of the transformed features up to K steps (most models), and (2) concatenation (GPR-GNN, GraphSAGE, and GCNII). Simple approaches like SGC are categorized as summation due to the self-loops in Ãsym . Distinguishing Factor 2 (Modification of adjacency matrices). How should we normalize or modify the adjacency matrix? The three prevailing choices are (1) symmetric vs. row normalization, (2) the strength of self-loops, including making zero self-loops, and (3) static vs. dynamic adjustment based on the given features. Most models use the symmetric normalization Ãsym with self-loops, but some variants avoid selfloops and use either row normalization A row or symmetric one A sym . Recent models such as DGC, G 2 CN, and GCNII determine the weight of self-loops with hyperparameters, since strong self-loops allow one to increase the value of K for distant propagation. Finally, attention-based models learn the elements in A based on node features, making propagator functions quadratic with X. Distinguishing Factor 3 (Heterophily). What to do if the direct neighbors differ in their features or labels? In such cases, the simple aggregation of the features of immediate neighbors may hurt performance, and therefore, several GNNs do suffer under heterophily as shown in Figure 1a and Table 2 . GNNs that can handle heterophily adopt one or more of these ideas: (1) using the square of A as the base structure (in G 2 CN; "the enemies of my enemy are my friends"); (2) learning different weights for different steps (GPR-GNN, ChebNet, SAGE, and GCNII), and (3) making small or no self-loops in the A matrix (DGC, S 2 GC, and G 2 CN). The idea is to avoid or downplay the effect of immediate (and odd-step-away) neighbors. Self-loops hurt under heterophily, as they force to have information of all intermediate neighbors by acting as implicit summation of transformed features.

4. PROPOSED METHOD: SlenderGNN

We propose SlenderGNN, a novel GNN model that addresses the limitations of existing GNNs with the strict adherence to the 'careful simplicity' principle. We first present the pain points of existing GNNs derived from Table 1 and then describe how SlenderGNN addresses them.

4.1. PAIN POINTS OF EXISTING GNNS

Pain Point 1 (Lack of robustness). All models in Table 1 fail to handle multiple graph scenarios at ← R-f2, D3 the same time, i.e., graphs with homophily, heterophily, no network effects, or useless features. 1 assume a specific scenario, such as homophily or heterophily graphs, rather than being able to perform in multiple scenarios at the same time. For example, all of these models except ChebNet and SAGE include the self-loops in the updated adjacency matrix, emphasizing the local neighborhood even in graphs with heterophily or no network effects. This is the pain point that we also observe empirically from the sanity checks (in Figure 1a and Table 2 ).

Models in Table

Pain Point 2 (Failure on noisy features). All models in Table 1 depend on the node feature matrix ← R-f2, D3 X, and cannot fully exploit the adjacency matrix A if the given features are noisy. Real-world datasets often contain noisy features, and the graph structure A plays an essential role in the performance of node classification in such cases. That is, a desirable property for a robust model is to adaptively emphasize important features or disregard noisy ones to maximize its generalization performance. However, the models in Table 1 lack such a functionality. Pain Point 3 (Hyperparameters in propagators). The hyperparameters in a propagator function P impair the interpretability of the weight matrix W, and force the re-computation of the transformed feature for every new choice during hyperparameter search. The last column of Table 1 summarizes the hyperparameters in each P, which make the following limitations in terms of linear models. First, the interpretability of the weight matrix W is impaired, because it is learned on the transformed feature P(A, X) whose meaning changes arbitrarily by the choice of hyperparameters. Second, P(A, X) should be computed for each choice of hyperparameters, while it can be cached and reused for searching hyperparameters outside P.

4.2. DESIGN DECISIONS OF SlenderGNN

Summary: Considering all the pain points and adhering to the 'careful simplicity' principle, we make design decisions (D1-D4) that lead to the following propagator function P of SlenderGNN: P(A, X) = U Structure ∥ g(X) Node features ∥ g(A 2 row X) 2-step neighbors ∥ g( Ã2 sym X) Neighbors (3) where g(•) is the principal component analysis (PCA) for the orthogonalization of each component, followed by an L2 normalization, and U ∈ R n×r contains r-dimensional structural features derived by running the low-rank singular value decomposition (SVD) on the adjacency matrix A.

D1: Concatenation of winning normalizations

The main principle of SlenderGNN to acquire the ← R-f2, D3 robustness and generalizability, in response to Pain Point 1, is to transform the raw features into various forms and then combine them by concatenation. In this way, SlenderGNN is able to emphasize essential features or ignore useless ones by learning separate weights for different components. The four components of Equation 3 are proposed to have their strength in different scenarios: the structural information U, self-feature information X, two-step aggregation A 2 row for heterophily graphs, and the smoothed two-hop aggregation Ã2 sym of local neighborhood, respectively. Specifically, we use the row-normalized matrix A row with no self-loops due to the limitations of the symmetric normalization Ãsym : First, the self-loops force one to combine all intermediate neighbors until the K-hop distance, even in heterophily graphs where the direct neighbors should be avoided. Second, the neighboring features are rescaled based on the node degrees during an aggregation, even when we want simple aggregation of K-hop neighbors preserving the original scale.

D2: Structural features

In response to Pain Point 2 where features are missing, noisy, or useless ← R-f2, D3 for classification, then we have to resort to the adjacency matrix A ignoring X. At the same time, it is not effective to use raw A, which is a large sparse matrix. We thus adopt low-rank SVD with rank r to extract structural features U. The value of r is automatically selected to keep 90% of the energy of A, where the sum of the largest r squared singular values divided by the squared Frobenius norm of A is approximately 0.9. When the graph is large, we set r to be d for the size consistency.

D3: Orthogonalization and sparsification

We use two reliable methods to further address Pain ← R-f2, D3 Point 2 about noisy features: dimensionality reduction by PCA and regularization by group LASSO. First, we run PCA on each component independently to orthogonalize given features and to improve the consistency of learned weights. Second, we apply group LASSO to learn sparse weights on the component level, preserving the relative magnitude of each element and suppressing noisy features. To make the consistency between multiple components, we force all components to have the same dimensionality by selecting r features from each component when adopting PCA.

D4: No hyperparameters in the propagator P

We address Pain Point 3 by making the propagator function P contain no hyperparameters to tune for each dataset. Most models in Table 1 contain such hyperparameters. For example, DGC is a linear model, but its interpretability is limited since it selects K and T from arbitrary values, i.e., K ∈ {250, 300, 900} and T ∈ {5.27, 3.78, 6.0498}. On the contrary, our effective design of P (i.e., D1 and D2) allows us to keep the small value of K = 2 in all 13 datasets we use in the experiments without sacrificing its performance.

5. PROPOSED SANITY CHECKS

We propose sanity checks to evaluate the robustness of GNNs to various scenarios of node classification and to observe their strengths and weaknesses in different settings. Graph scenarios We categorize possible scenarios of node classification based on the characteristics of node features X, a graph structure A, and node labels y. We describe only the main ideas, leaving exact definitions of such scenarios to Appendix D. • Features X: We consider three cases: random, structural, and semantic. The random case means that each feature is determined independently of all other variables. In the structural case, features give information of the graph structure, and in the semantic case, they directly provide useful information for node labels. Unlike the cases of edges and labels, which are mutually exclusive, features can be both structural and semantic at the same time. • Edges A: We consider three cases: uniform (no communities), clustered (block-diagonal), and bipartite. The uniform case means that every element a ij is determined independently of the other edges in A. In the clustered case, nodes having common neighbors are likely to make more edges, while it is the opposite in the bipartite case. • Labels y: We consider three cases: individual (no network effects, i.e., there is no predictive power of connectivity), homophily, and heterophily. In the homophily case, adjacent nodes are likely to have the same label, while it is the opposite in the heterophily case. Feasibility Although there exist a total of 27 combinations for (X, A, y), not all of them are possible to implement; for example, either homophily or heterophily y is not compatible with uniform A. After removing the infeasible combinations of variables, we categorize the remaining choices based on the predictive power of structure (A) and features (X) for labels (y). Results of sanity checks Table 2 shows the results of sanity checks for our SlenderGNN and all baseline models (details of the baselines in Section 6). , , represent top three methods (higher is darker) with overlap within 2σ, and represents accuracy below 2σ of the third-best method. We run each experiment five times and report the average and standard deviation. We assume four target classes of nodes, and thus the accuracy of random guessing is 25%. The letters represent the cases of X, A, and y, respectively: S (semantic X), R (random X), T (structural X), U (uniform A), C (clustered A), B (bipartite A), I (individual y), O (homophily y), and E (heterophily y). It is clear that our SlenderGNN is the only approach that passes all sanity checks. The four compo-← R-D4 nents in SlenderGNN are carefully designed to maximize its robustness for various graph scenarios. Most GNNs work well in the cases (S, C, O) and (T, C, O), where X is either semantic or structural, A is clustered, and y is homophily, since many real-world datasets that recent works on GNNs use in their experiments follow such assumptions. However, many GNNs fail when a graph is generated with different assumptions, as we summarize as follows: • No network effects: In the cases of (S, ?, I), where ? is a placeholder, only a few models such as SAGE and GPR-GNN perform well. This is because A is not informative in such cases, and models are required to focus on raw features X ignoring A. • Useless features: In the cases of (R, C, O) and (R, B, E), models are required to do the opposite since X is not informative: they should focus on A, ignoring X. Since there is no approach that explicitly uses A, all baselines show low accuracy. • Heterophily graphs: In the cases of (?, B, E), where ? is a placeholder, the labels follow heterophily. Models like G 2 CN, SAGE, and GPR-GNN perform well in these cases, since they address the heterophily in their designs (details in Section 3.1).

6. EXPERIMENTS

We perform experiments on 13 real-world datasets to answer the following research questions (RQ): Datasets and competitors We use 7 homophily and 6 heterophily graphs in experiments, which were commonly used in previous works on node classification (Chien et al., 2021; Pei et al., 2020; Lim et al., 2021) . We adopt various types of models as competitors: linear GNNs (LR, SGC, DGC, S 2 GC, and G 2 CN), coupled nonlinear models (GCN, GraphSAGE, and GCNII), decoupled models (APPNP and GPR-GNN), and attention-based models (GAT). We also include three graph kernel ← R-V2 methods (Smola & Kondor, 2003) , namely Regularized Laplacian, Diffusion Process, and the Kstep Random Walk. They are used with LR as SlenderGNN is. We perform hyperparameter search based on those reported in their original papers. Refer to Appendix E for details. Experimental setup We perform semi-supervised node classification by dividing all nodes in a graph by the 2.5%/2.5%/95% ratio into training, validation, and testing data. We perform five runs of each experiment with different random seeds and report the average and standard deviation. All hyperparameter search and early stopping are done based on validation accuracy for each run. RQ1. Accuracy In Table 3 , SlenderGNN is compared against linear as well as the state-of-theart nonlinear GNNs on 13 real-world datasets (7 homophily and 6 heterophily graphs). We report the accuracy in Table 3 where , , represent the top three methods (higher is darker), represents the accuracy below 2σ of the third-best method, and represents the out-of-memory error (O.O.M.). Our SlenderGNN outperforms all competitors in 4 homophily and 5 heterophily datasets, and shows competitive accuracy in the rest (11 out of 13 times among top three methods). Moreover, SlenderGNN is the only model without red cells, which demonstrates its robustness and generality. Many competitors run out of memory when the graph reaches the million-edge scale.

RQ2. Speed and scalability

We plot the training time versus the accuracy of each model on the ← R-V3, y3 ogbn-arXiv, ogbn-Products, and Pokec datasets, which are largest in our benchmark, in Figure 1b . We report the training time of each model with the hyperparameters that show the highest validation accuracy. SlenderGNN achieves the highest accuracy in the ogbn-arXiv and ogbn-Products datasets, while being 10.4× and 2.5× faster than the second-best model, respectively. SlenderGNN also shows the highest accuracy in the Pokec dataset, which is a large heterophily graph, while being 18.0× faster than the best performing deep model. It is worth noting that SlenderGNN is even faster than LR in ogbn-arXiv, requiring only half the number of iterations that LR needs while optimizing. Its fast convergence is owing to orthogonalization on each component of the features.

RQ3. Interpretability

Figure 2 illustrates the learned weights of our SlenderGNN for the sanity ← R-V3, D4 checks, where the ground truths are known. SlenderGNN assigns large weights to the correct factors in graphs with different mutual information between variables. When there are no network effects in Figure 2a , it successfully assigns the largest weights to the self-features g(X), ignoring all other components. When the features are useless in Figure 2b , it puts most of the attention on the structural features U. In the left two of Figure 2c , when the features are useful but largely correlated with the structure, it ignores the self-features g(X) and assigns the largest weights to the graph-propagated features. In the right two of Figure 2c , when every component is informative, SlenderGNN assigns large weights to both self-and propagated features to maximize its accuracy.

RQ4. Ablation studies

We perform three ablation studies to better understand how SlenderGNN ← R-V1, V3 works: its (a) linearity, (b) receptive field, and (c) four different components. We show the result on 8 and 9 , respectively, in Appendix F. Detailed information on the settings of ablation studies is also given in Appendix F. In short, we observe that (a) the linear version performs best, (b) the two-step aggregation is sufficient, and (c) all four components in SlenderGNN are essential in its performance. Specifically, the success of SlenderGNN against its nonlinear variants in Table 4 supports the power ← R-f1, y1, y2, D2 of the 'careful simplicity' principle. The increased expressiveness harms the accuracy since (a) the original SlenderGNN is already effective to capture necessary information for classification, and (b) an increased number of parameters cause the loss of generality through overfitting. Needless to say, the nonlinear variants take longer time in training, have more hyperparameters to tune, and lose the interpretability which is the great advantage of linear models (as we show in Figure 2 ).

7. CONCLUSION

The main contribution (C1) of this work is SlenderGNN, which is designed by the 'careful simplicity' principle, and thus has a long list of desirable properties: • C1.1 -Accurate: On both synthetic and real graphs, SlenderGNN exceeds the accuracy of state-of-the-art linear GNNs and matches the accuracy of nonlinear models. • C1.2 -Robust: SlenderGNN succeeds in graphs with homophily, heterophily, no network effects (i.e., random connections), and no meaningful features. • C1.3 -Fast and scalable: SlenderGNN is scalable to million-scale graphs, where most of the existing models run out of memory, with up to 18× less training time. • C1.4 -Interpretable: SlenderGNN automatically selects important features and can justify its decisions based on the learned weights, thanks to its linearity. Additional contributions focus on explaining the success of SlenderGNN: • C2 -Explanation: The GNNLIN framework illuminates the fundamental similarities and differences of popular GNN variants (see Table 1 ). • C3 -Sanity checks: Our sanity checks immediately highlight the strengths and weaknesses of each GNN method before it is sent to production (see Table 2 ). Reproducibility: Our source code and 'sanity checks' are available at https://bit.ly/3fhWJfK. ← R-D5 Adjacency matrix with self-loops D = diag( Ã1 n×1 ) Degree matrix of Ã Ãsym = D-1/2 Ã D-1/2 Symmetrically normalized Ã A sym = D -1/2 AD -1/2 Symmetrically normalized A (i.e., no self-loops) A row = D -1 A Row-wisely normalized A (i.e., no self-loops) A col = AD -1 Column-wisely normalized A (i.e., no self-loops) diag(•) Function that creates a diagonal matrix from a vector 1 a×b Matrix of size a × b filled with ones I a Identity matrix of size a × a, where the subscript a can be omitted

LR(•)

Logistic regression function defined in Definition 1 W Learnable weight matrix of size h × c in the LR function P(•) Feature propagator function defined in Definition 2

A TABLE OF SYMBOLS

Table 5 summarizes the symbols frequently used in this paper. The formal problem definition and the detailed description of such symbols are presented in Section 2.

B PROOF OF LEMMA 1: REPRESENTING LINEAR MODELS WITH GNNLIN

We prove Lemma 1 by representing each linear GNN with GNNLIN. Let K ≥ 0 be a hyperparameter that determines the number of propagation steps in every GNN. SGC (Wu et al., 2019) fits the definition of linearization with the following propagator function: P(A, X) = ÃK sym X. DGC (Wang et al., 2021) has variants, DGC-Euler and DGC-DK, which have different propagator functions. We focus on DGC-Euler, which is used as the main model in their experiments. DGC is similar to SGC, except that it controls the strength of self-loops as follows: P(A, X) = [(1 -T /K)I + (T /K) Ãsym ] K X, where T > 0 is a hyperparameter. The self-loops become stronger if T is closer to 0. S 2 GC (Zhu & Koniusz, 2021) computes the summation of features propagated with different numbers of steps: P(A, X) = K k=1 (αI + (1 -α) Ãk sym )X. The original formulation divides the added features by K, which can be safely ignored considering that the weight matrix W is multiplied to the transformed feature for classification. G 2 CN (Li et al., 2022) does not provide an explicit formulation of the propagator function, and thus we derive it. First, the parameterized version P ′ of the propagator function is given as follows: P ′ (A, X; {θ i } N i=1 ) = N i=1 θ i H i,K , where θ i is a learnable parameter. The k-th feature representation H i,k is recursively defined as H i,k = H i,k-1 - T i K (L -b i I) 2 H i,k-1 = H i,k-1 - T i K ((b i -1)I + A sym ) 2 H i,k-1 = [I - T i K ((b i -1)I + A sym ) 2 ]H i,k-1 H i,0 = X, where N , T i , and b i are hyperparameters, and L = I -A sym is the normalized Laplacian matrix. Since the transformed features are combined with a learnable parameter θ i in Equation 7, we make a propagator function P that contains no learnable parameters as follows: P(A, X) = N ∥ i=1 H i,K = N ∥ i=1 [I - T i K ((b i -1)I + A sym ) 2 ] K X. C LINEARIZATION PROCESSES We present detailed processes to linearize various graph neural networks (GNN) as in Table 1 . The linearization is done in two steps. First, we replace all activation functions with the identity function from the original definition of each GNN. Second, if the resulting layer function contains learnable parameters, we devise a replacement that is at least as expressive as the given function but containing no learnable parameters. This is because our goal of linearization is not just deriving a linear function with respect to X, but understanding GNNs in relation to logistic regression based on our GNNLIN framework. We ignore the bias terms of linear layers for simplicity, without loss of generality. Decoupled models PPNP, APPNP (Klicpera et al., 2019a) , GDC (Klicpera et al., 2019b) , and GPR-GNN (Chien et al., 2021) are decoupled GNNs that separate feature transformation and propagation stages. PPNP runs Personalized PageRank on the node features, and APPNP approximates PPNP with K steps of message propagation. GDC generalizes APPNP by increasing the value of K to ∞ and sparsifies the propagator matrix S. GPR-GNN also generalizes APPNP to avoid the usage of α by learning a weight for each component, resulting in the concatenation of multiple different features. We provide details of linearization in Appendix C.1.

Coupled models

We linearize coupled GNNs including ChebNet (Defferrard et al., 2016) , GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , and GCNII (Chen et al., 2020) . The linearized version of GCN is the same as SGC, since the motivation of SGC is to linearize GCN for better scalability and robustness. Although their motivations are different, the linearized versions of ChebNet, GraphSAGE, and GCNII are similar to linearized GPR-GNN in that features propagated by different steps are combined by concatenation. We provide the details in Appendix C.2. Attention models Attention-based GNNs (Velickovic et al., 2018; Kim & Oh, 2021; Brody et al., 2022) are a popular category of GNNs to learn the importance of each edge based on X and learnable parameters. Thus, it is not straightforward to linearize them by the GNNLIN framework: even if we assume the learnable parameters in P as fixed hyperparameters, P is at least quadratic with X, since X participates in computing the new adjacency matrix that is multiplied again with X. Nevertheless, we perform incomplete linearization of attention-based models for completeness, and represent the results with the '**' symbol in Table 1 . Detailed processes are given in Appendix C.3. The linearized version of GAT uses a different adjacency matrix for each step k, which is the main difference from other linearized models. It is noteworthy that DA-GNN (Liu et al., 2020) , proposed as a decoupled model in their paper, is an attention-based model in our analysis. This is because X participates in computing the new adjacency matrix, making P quadratic as in GAT.

C.1 LINEARIZATION OF DECOUPLED GNNS

The propagation in decoupled GNNs is done on the abstract representations of node features, which are typically generated by multilayer perceptrons (MLP). Linearization starts with replacing MLPs with linear projections, and removes additional nonlinearity in the process of propagation. C.1.1 PPNP (KLICPERA ET AL., 2019A) The lineraization of PPNP is straightforward, since the authors present a closed-form representation of the propagator function. If we remove the activation function, we have the following: P(A, X) = (I n -(1 -α) Ãsym ) -1 X, where 0 < α < 1 is a hyperparameter that controls the weight of self-loops. C.1.2 APPNP (KLICPERA ET AL., 2019A) We assume that the initial node representation is created by a single linear layer of XW, where W is a weight matrix. Then, the k-th representation matrix H k is represented as follows: H k = (1 -α) Ãsym H k-1 + αXW, where 0 < α < 1 is a hyperparameter. The closed-form representation of H K is given as follows: H K = [(1 -α) Ãsym ] K + α K-1 k=0 [(1 -α) Ãsym ] k XW. ( ) We safely remove the weight matrix W, which is redundant, and get the fiinal representation: P(A, X) = [ K-1 k=0 α(1 -α) k Ãk sym + (1 -α) K ÃK sym ]X C.1.3 GDC (KLICPERA ET AL., 2019B) GDC generalizes APPNP and presents various forms of the propagation function. We pick the most representative one given in the paper, which is directly related to APPNP. The unnormalized version of the propagation matrix S ′ is given as follows: S ′ = ∞ k=0 α(1 -α) k Ãk sym , and then it is normalized and sparsified as S = sparsify( S′ sym ). S′ sym represents adding self-loops and applying the symmetric normalization to S ′ . The paper gives two approaches for sparsification, which are a) removing elements smaller than ϵ, which is given as a hyperparameter, and b) selecting the top k neighbors for each node. With any choice, the function P is simply given as follows: P(A, X) = SX. C.1.4 GPR-GNN (CHIEN ET AL., 2021) We assume that the initial node representation is created by a single linear layer of XW, where W is a weight matrix. Then, we have the following propagator function P ′ with parameters: P ′ (A, X; {θ k } K k=0 ) = K k=0 θ k Ãk sym X, where θ k is a parameter that is learned together with W. We replace the summation with concatenation to remove the learnable parameters from P ′ and get the following: Let H k be the k-th node representation matrix, and L = I -A sym be the graph Laplacian matrix normalized symmetrically. Then, the propagator function of ChebNet with parameters θ is P(A, X) = K ∥ k=0 Ãk sym X. P ′ (A, X; θ) = K-1 k=0 θ k H k , where the recurrence relation is given as follows with the intial terms: H 0 = X H 1 = LX = -A sym X + X • • • H k = 2(L -I)H k-1 -H k-2 = -2A sym H k-1 -H k-2 . ( ) Based on the recurrence relation, the closed-form representation of H k is given as H k = a k A k sym X + a k-1 A k-1 sym X + • • • + a 0 X, where a 0 , • • • , a k are constants. Since we have K free parameters θ 0 , • • • , θ K corresponding to the K terms in the representation matrix H K , we safely rewrite the propagator function as P ′ (A, X; θ) = K-1 k=0 θ k A k sym X. Each value of k has a free parameter θ k . Thus, we generalize it as follows: P(A, X) = K-1 ∥ k=0 A k sym X. C.2.2 GRAPHSAGE (HAMILTON ET AL., 2017) We assume the mean aggregator of GraphSAGE. By replacing the activation function as the identity function, each layer F of GraphSAGE is linearized as follows: F(X) = XW 1 + A row XW 2 , where A row = D -1 A represents the mean operator in the aggregation, and W 1 and W 2 are learnable weight matrices in the layer. If we apply a chain of two layers, where the weight matrices of the second layer are represented as W 3 and W 4 , we get the following: F(F(X)) = XW 1 W 3 + A row X(W 1 W 2 + W 2 W 3 ) + A 2 row XW 2 W 4 (25) = XW a + A row XW b + A 2 row XW c , where we redefine the weight matrices without loss of generality as W a = W 1 W 3 (27) W b = W 1 W 2 + W 2 W 3 (28) W c = W 2 W 4 . If we generalize it into K layers, we get the following: F K (X) = K k=1 A k row XW k . Note that a different weight matrix W k is applied to each layer k. This is equivalent to concatenating the transformed features of all layers and learning a single large weight matrix in training. • Semantic: p(x i , x j | y i , y j , a ij ) ̸ = p(x i , x j | a ij ) The random case represents that features are determined independently of the graph and labels. In this case, node features do not give useful information for classification, but work as unique indices of nodes like one-hot embeddings if the dimensionality d of features is large. In the structural case, features are correlated with the graph structure, but not directly with labels. Such features give useful information for classification only if the graph structure and labels are related. In the semantic case, features directly provide useful information for predicting labels. Note that features can be semantic and structural at the same time by satisfying both of the conditions.

D.2 IMPLEMENTATION OF GRAPH SCENARIOS

There are various ways to generate synthetic graphs satisfying the definitions of different scenarios. One can use a synthetic graph generator designed to create more plausible graphs (Leskovec et al., 2010; Barabási & Albert, 1999 ), but we choose the simplest one to focus on the mutual information, rather than the other characteristics of synthetic graphs such as the degree distribution. Structure We assume that the number of structural clusters is the same as the number c of labels for the alignment with label information. We divide all nodes into c groups and then decide the edge densities for intra-and inter-connections of groups based on the structural type: uniform, clustered, and bipartite. We use a hyperparameter ϵ a to determine the noise level in the case of homophily or heterophily: for example, if ϵ a = 0, the graph has a full block-diagonal adjacency matrix in the case of homophily. The expected number of edges is the same for all three cases, and we set ϵ a = 0 in all of our experiments. A notable characteristic of a bipartite structure A is that A 2 has a clustered structure. If we create inter-group connections for every possible pair of different groups, even noise-free A with ϵ a = 0 makes noisy A 2 with inter-group connections. For better consistency, we set the number of classes to an even number in our experiments, randomly pick paired classes such as (1, 3) and (2, 4) when c = 4, for example, and create inter-group connections only for the chosen pairs. In this way, we create a non-diagonal block-permutation matrix A when ϵ a = 0, and the noise level of A 2 , which has a clustered structure, is solely controlled by the noise level of A.

Labels

In the case of individual y, we determine the label of each node uniformly at random, with no consideration of the graph structure. In the case of homophily or heterophily y, we assign labels based on the groups of nodes assumed by the graph structure. That is, nodes in the same group have the same label, and thus the group index itself works as the label y. In this way, we force homophily y for clustered A, and heterophily y for bipartite A. The degree of homophily (or heterophily) is also determined by the noise level ϵ a of the graph structure. Features We basically assume that every feature element is sampled from a uniform distribution. Thus, in the random case, we sample each element from the uniform distribution U(0, 1) between 0 and 1. In the structural case, we run low-rank support vector decomposition (SVD) (Halko et al., 2011) to make X have structural information. Given UΣV ⊤ ≈ A from low-rank SVD, we take U and normalize each feature element to have the zero-mean and unit-variance. The rank r in the SVD is determined as a hyperparameter; higher r captures the structure better, but can give noisy information. We also apply the ReLU function to U to make them positive. In the semantic case, we randomly pick c representative vectors {v k } c k=1 from the uniform distribution, which correspond to the c different classes. Then, for each node i with label y, we sample a feature vector such that arg max k x ⊤ v k = y. In this way, we have random vectors having sufficient semantic information for the classification of labels, with a guarantee that the perfect linear decision boundaries can be drawn in the feature space X at the training time.

E REPRODUCIBILITY E.1 DATASETS

We use 7 homophily and 6 heterophily datasets in experiments, which were used widely in previous works on node classification (Chien et al., 2021; Pei et al., 2020) . Cora, CiteSeer, and PubMed (Sen et al., 2008; Yang et al., 2016) are homophily citation graphs between research articles. Computers SlenderGNN wd 1 = [1e -3 , 1e -4 , 1e -5 ], wd 2 = [1e -3 , 1e -4 , 1e -5 , 1e -6 ] and Photo (Shchur et al., 2018) are homophily Amazon co-purchase graphs between items. ogbn-arXiv and ogbn-Products are large homophily graphs from Open Graph Benchmark (Hu et al., 2020) . Since we use only 2.5% of all labels as training data, we omit the classes with instances fewer than 100. Chameleon and Squirrel (Rozemberczki et al., 2021) are heterophily Wikipedia graphs. Actor (Tang et al., 2009) is a heterophily graph connected by co-occurrence of actors on Wikipedia pages. Penn94 (Traud et al., 2012; Lim et al., 2021) is a heterophily graph of gender relations in a social network. Twitch (Rozemberczki & Sarkar, 2021) and Pokec (Leskovec & Krevl, 2014) are large graphs, which have been relabeled by (Lim et al., 2021) to be heterophily. We make the heterophily graphs undirected as done in (Chien et al., 2021) . The statistics of datasets are reported in Table 6 .

E.2 COMPETITORS

The propagator functions of graph kernel methods (Smola & Kondor, 2003) are given as follows: (Reg. Kernel) P(A, X) = (I n + σ 2 L) -1 X (45) (Diff. Kernel) P(A, X) = exp(-σ 2 /2 L)X (46) (RW Kernel) P(A, X) = (aI n -L) p X, where L = D -1/2 (D -A)D -1/2 is the normalized Laplacian matrix, and σ = 1, a = 1, and p = 2 are hyperparameters. We use the reasonable default values introduced in the paper. We perform row-normalization on the node features of all datasets as done in most studies on GNNs. We report the hyperparameters used for a grid search in Table 7 , which is done for every split of data. The dimensions of hidden layers are all set to 64, and the probabilities of dropout layers are all set to 0.5. For the linear models, we use L-BFGS as the optimizer for training 100 epochs with patience 5; for the nonlinear ones, we use ADAM and train them for 1000 epochs with patience 200. SlenderGNN contains only 2 hyperparameters, where wd 1 is the weight of LASSO, and wd 2 is the weight of group LASSO. It is worth noting that, when searching the hyperparameters, SlenderGNN does not need to recompute the features, while most of the linear methods need to do so because of including one or more hyperparameters in the features that can be precomputed otherwise.



α is set to 0.1 in the original paper of GCNII(Chen et al., 2020).



Figure 2: SlenderGNN is interpretable: it suppresses useless information and focuses on the informative ones for each scenario: (a) self-features, (b) structural features, and (c) mixed.

5e -4 ], K = 2 DGC wd = [0, 5e -4 ], K = 200, T = [3, 4, 5, 6] S 2 GC wd = [0, 5e -4 ], K = 16, α = [0.01, 0.03, 0.05, 0.07, 0.09] G 2 CN wd = [0, 5e -4 ], K = 100, N = 2, T 1 = T 2 = [10, 20, 30, 40], b 1 = 0, b 2 = 2 GCN wd = [0, 5e -4 ], lr = [2e -3 , 0.01, 0.05], K = 2 SAGE wd = [0, 5e -4 ], lr = [2e -3 , 0.01, 0.05], K = 2 GCNII wd = [0, 5e -4 ], lr = 0.01, K = [8, 16, 32, 64], α = [0.1, 0.2, 0.5], θ = [0.5, 1, 1.5] APPNP wd = [0, 5e -4 ], lr = [2e -3 , 0.01, 0.05], K = 10, α = 0.1 GPR-GNN wd = [0, 5e -4 ], lr = [2e -3 , 0.01, 0.05], K = 10, α = [0.1, 0.2, 0.5, 0.9] GAT wd = [0, 5e -4 ], lr = [2e -3 , 0.01, 0.05], K = 2, heads = 8

GNNLIN framework is general encompassing popular GNN models. The * and ** superscripts mark fully and partially linearized models, respectively; see Section 3.1 for details.

SlenderGNN passes all sanity checks. The accuracy of all models on sanity checks; there are three groups of scenarios: (left) only features X help; (middle) only connectivity A helps; (right) both help. See the text for details on S, U, I, C, B, etc. Green ( , , ) marks the top three (higher is darker); red ( ) marks the ones that are too low (2σ below the third place).

SlenderGNN wins most of the times on 13 real-world datasets (7 homophily and 6 heterophily graphs) against 14 competitors. We color the best and worst results as in Table2.

Ablation Study -SlenderGNN works best with linearity: SlenderGNN outperforms its own variants that replace the linear classifier or the PCA function g with a nonlinear module.

and give the results of the last two experiments in Table

Table of symbols.

The statistics of datasets used in our experiments. The first seven datasets are homophily graphs, while the last six are heterophily graphs. CiteSeer PubMed Computers Photo ogbn-arXiv ogbn-Products Chameleon Squirrel Actor Penn94 Twitch Pokec

Search spaces of hyperparameters.

C.2.3 GCNII (CHEN ET AL., 2020)

After replacing the activation function with the identity function, the l-th layer F l of GCNII is given as follows:F l (H) = ((1 -α l ) Ãsym H + α l X))((1 -β l )I + β l W l ), (31) where α l and β l are hyperparameters, and W l is a weight matrix. The second term is equivalent to W l regardless of the value of β l , since W l is a free parameter. We also set α l to a constant α which is the same for every layer l, following the original paper (Chen et al., 2020) . 1 Then, the equation is simplified asIf we apply a chain of two layers l and l + 1, we get the following:where, which is also a free parameter. If we generalize it into K layers, we get the following:We safely remove the constant from each term, which can be included in the weight matrix:We replace the summation operators between terms having different weight matrices with concatenation operators, having the final propagator function P as follows:We assume that the initial node representation is created by a single linear layer of XW, where W is a weight matrix. Then, the k-th representation matrix H k is represented as follows:DA-GNN computes the weighted sum of representations for all k ∈ [0, K], where the weight values are determined also from the representation matrices:where s is a learnable weight vector. We safely remove the last W and rewrite Ws as w. Then, we have the final representation of the propagator function:C.3.2 GAT (VELICKOVIC ET AL., 2018)We apply the following changes to linearize GAT, whose linearization is not straightforward due to the nonlinearity in the attention function:1. We replace the activation functions between layers with the identity functions.2. We simplify the attention function α ij = exp(e ij )/ k exp(e ik ) as e ij .3. We remove the LeaklyReLU function in the computation of e ij .4. We assume the single-head attention.The edge weight e ij , which is the (i, j)-th element of the propagator matrix, is defined as follows:where x i and x j are feature vectors of length d for node i and j, respectively, W is a d × c learnable weight matrix, and a dst and a src are learnable weight vectors of length c. Then, we derive the initial form of a linearized GAT layer as follows:Since all a dst , a src , and W are free parameters, we generalize it as follows:where w dst and w src are learnable vectors of length m that replace a dst and a src , respectively.

D DETAILS ON SANITY CHECKS D.1 FORMAL DEFINITIONS OF GRAPH SCENARIOS

We categorize all possible scenarios of node classification based on the characteristics of node features X, a graph structure A, and node labels y. We denote by A ij and Y i the random variables for edge (i, j) between nodes i and j and label y i of node i, respectively.Edges For the adjacency matrix A, we consider the following three cases:• Uniform:The uniform case represents that every edge is determined independently of the others, and the graph structure gives no useful information for node classification. In the clustered case, it is more likely that nodes having common neighbors make more connections in a graph. In the bipartite case, nodes make more connections with those sharing no common neighbors.Labels For the labels in y, we consider the following three cases:• Individual:The individual case means that the label of a node is independent of the labels of its neighbors. This is the case when the graph structure works as noise information with regard to classification. In the homophily case, adjacent nodes are likely to have the same label. It is the most popular assumption of GNNs for node classification. In the heterophily case, adjacent nodes are likely to have different labels, which is not as common as homophily but often observed in real-world graphs.Features For the feature matrix X, we consider the following three cases based on A and y. We use the notation of p(•) since the features are typically modeled as continuous variables.• Random: p(x i , • w/ MLP-2: We replace LR with a 2-layer MLP.• w/ MLP-3: We replace LR with a 3-layer MLP.• w/ NL Trans.: We replace the PCA function g(•) with a nonlinear function. Specifically, we adopt a 2-layer MLP for the first two components, and a 2-layer GCN for the last two.The transformed features are concatenated and input into another 2-layer MLP.We use dropout with a probability 0.5 to prevent overfitting in both MLP and GCN. The nonlinear models are trained with the same setting as GCN reported in Table 7 . We report the result in Table 4 , showing that adding nonlinearity does not necessarily improve the accuracy, while sacrificing both scalability and interpretability. We choose to use linear function based on this result.Receptive Fields To test the effect of changing the receptive field of our SlenderGNN, we vary the ← R-V1 distance of aggregation as in Table 8 . k row denotes the number of steps for A row , while k sym denotes the number of steps for Ãsym . Since A row is designed to consider heterophily relations, we use only the even values of k row . Table 8 shows that the first three places in each dataset are statistically tied in most of the cases. In other words, we have no significant gain by increasing the values of k row and k sym , and thus the 2-step aggregation for both A row and Ãsym is good enough. To keep our model simple and general, we use k row = 2 and k sym = 2 in all experiments on synthetic and real data.Components We evaluate the accuracy of SlenderGNN when each of its core modules is disabled: the sparse regularization, PCA, and structural features U. SlenderGNN performs well with all these ideas in all 13 datasets, i.e., it is always included in the top two at each dataset. This shows that our SlenderGNN is designed effectively with ideas that help improve its performance.

