GLINKX: A SCALABLE UNIFIED FRAMEWORK FOR HOMOPHILOUS AND HETEROPHILOUS GRAPHS

Abstract

In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -GLINKX -that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets.

1. INTRODUCTION

In recent years, graph learning methods have emerged with a strong performance for various ML tasks. Graph ML methods leverage the topology of graphs underlying the data (Battaglia et al., 2018) to improve their performance. Two very important design options for proposing graph ML based architectures in the context of node classification are related to whether the data is homophilous or heterophilous. For homophilous data -where neighboring nodes share similar labels (McPherson et al., 2001;  Altenburger & Ugander, 2018a) -Graph Neural Network (GNN)-based methods are able to achieve high accuracy. Specifically, a broad subclass sucessfull GNNs are Graph Convolutional Networks (GCNs) (e.g., GCN, GAT, etc.) (Kipf & Welling, 2016; Veličković et al., 2017; Zhu et al., 2020) . In the GCN paradigm, message passing and higher-order interactions help node classification tasks in the homophilous setting since such inductive biases tend to bring the learned representations of linked nodes close to each other. However, GCN-based architectures suffer from scalability issues. Performing (higher-order) propagations during the training stage are hard to scale in large graphs because the number of nodes grows exponentially with the increase of the filter receptive field. Thus, for practical purposes, GCN-based methods require node sampling, substantially increasing their training time. For this reason, architectures (Huang et al., 2020; Zhang et al., 2022b; Sun et al., 2021; Maurya et al., 2021; Rossi et al., 2020) that leverage propagations outside of the training loop (as a preprocessing step) have shown promising results in terms of scaling to large graphs. In heterophilous datasets (Rogers et al., 2014) , the nodes that are connected tend to have different labels. Currently, many works that address heterophily can be classified into two categories concerning scale. On the one hand, recent successful architectures (in terms of accuracy) (Jin et al., 2022a; Di Giovanni et al., 2022; Zheng et al., 2022b; Luan et al., 2021; Chien et al., 2020; Lei et al., 2022) that address heterophily resemble GCNs in terms of design and thus suffer from the same scalability issues. On the other hand, shallow or node-level models (see, e.g., (Lim et al., 2021; Zhong et al., 2022) ), i.e., models that are treating graph data as tabular data and do not involve propagations during training, has shown a lot of promise for large heterophilous graphs. In (Lim et al., 2021) , it is shown that combining ego embeddings (node features) and adjacency embeddings works in the heterophilous setting. One element that LINKX exploits via the adjacency embeddings is monophily (Altenburger & Ugander, 2018a; b) , namely the similarity of the labels of a node's neighbors. However, their design is still impractical in real-world data since the method (LINKX) is not inductive (see Section 2), and embedding the adjacency matrix directly requires many parameters in a model. In LINKX, the adjacency embedding of a node can alternatively be thought of as a positional embedding (PE) of the node in the graph, and recent developments (Kim et al., 2022; Dwivedi et al., 2021; Lim et al., 2021) have shown the importance of PEs in both homophilous and heterophilous settings. However, most of these works suggest PE parametrizations that are difficult to compute in large-scale settings. We argue that PEs can be obtained in a scalable manner by utilizing knowledge graph embeddings, which, according to prior work, can (El-Kishky et al., 2022; Lerer et al., 2019; Bordes et al., 2013; Yang et al., 2014) be trained in very large networks. Goal & Contribution: In this work, we develop a scalable method for node classification that: (i) works both on homophilous and heterophilous graphs, (ii) is simpler and faster than conventional message passing networks (by avoiding the neighbor sampling and message passing overhead during training), and (iii) can work in both a transductive and an inductivefoot_0 setting. For a method to be scalable, we argue that it should: (i) run models on node-scale (thus leveraging i.i.d. minibatching), (ii) avoid doing message passing during training and do it a constant number of times before training, and (iii) transmit small messages along the edges. Our proposed method -GLINKX (see Section 3)combines all the above desiderata. GLINKX has three components: (i) ego embeddingsfoot_1 , (ii) PEs inspired by architectures suited for heterophilous settings, and (iii) scalable 2nd-hop-neighborhood propagations inspired by architectures suited for monophilous settings (Section 2.5). We provide novel theoretical error bounds and justify components of our method (Section 3.4). Finally, we evaluate GLINKX's empirical effectiveness on several homophilous and heterophilous datasets (Section 4).

2.1. NOTATION

We denote scalars with lower-case, vectors with bold lower-case letters, and matrices with bold uppercase. We consider a directed graph G = G(V, E) with vertex set V with |V | = n nodes, and edge set E with |E| = m edges, and adjacency matrix A. X 2 R n⇥d X represents the d X -dimensional features and P 2 R n⇥d P represent the d P -dimensional PE matrix. A node i has a feature vector x i 2 R d X and a positional embedding p i 2 R d P and belongs to a class y i 2 {1, . . . , c}. The training set is denoted by G train (V train , E train ), the validation set by G valid (V valid , E valid ), and test set by G test (V test , E test ). I{•} is the indicator function. c is the c-dimensional simplex.

2.2. GRAPH CONVOLUTIONAL NEURAL NETWORKS

In homophilous datasets, GCN-based methods have been used for node classification. GCNs (Kipf & Welling, 2016) utilize feature propagations together with non-linearities to produce node embeddings. Specifically, a GCN consists of multiple layers where each layer i collects i-th hop information from the nodes through propagations and forwards this information to the i + 1-th layer. More specifically, if G has a symmetrically-normalized adjacency matrix A 0 sym (with self-loops) (ignoring the directionality of edges), then a GCN layer has the form H (0) = X, H (i+1) = ⇣ A 0 sym H (i) W (i) ⌘ 8i 2 [L], Y = softmax ⇣ H (L) ⌘ . Here H (i) is the embedding from the previous layer, W (i) is a learnable projection matrix and (•) is a non-linearity (e.g. ReLU, sigmoid, etc.).

2.3. LINKX

In heterophilous datasets, the simple method of LINKX has been shown to perform well. LINKX combines two components -MLP on the node features X and LINK regression (Altenburger & Ugander, 2018a) on the adjacency matrix -as follows: Algorithm 1 GLINKX Algorithm  Input: Graph G(V, E) with train set V train ✓ V , H X = MLP X (X) , H A = MLP A (A) , Y = ResNet(H X , H A ).

2.4. NODE CLASSIFICATION

In node classification problems on graphs, we have a model f (X, Y train , A; ✓) that takes as an input the node features X, the training labels Y train and the graph topology A and produces a prediction for each node i of G, which corresponds to the probability that a given node belongs to any of c classes (with the sum of such probabilities being one). The model is trained with back-propagation. Once trained, the model can be used for the prediction of labels of nodes in the test set. There are two training regimes: transductive and inductive. In the transductive training regime, we have full knowledge of the graph topology (for the train, test, and validation sets) and the node features, and the task is to predict the labels of the validation and test set. In the inductive regime, only the graph induced by V train is known at the time of training, and then the full graph is revealed for prediction on the validation and test sets. In real-world scenarios, such as online social networks, the dynamic nature of problems makes the inductive regime particularly useful.

2.5. HOMOPHILY, HETEROPHILY & MONOPHILY

Homophily and Heterophily: There are various measures of homophily in the GNN literature like node homophily and edge homophily Lim et al. (2021) . Intuitively, homophily in a graph implies that nodes with similar labels are connected. GNN-based approaches like GCN, GAT, etc., leverage this property to improve the node classification performance. Alternatively, if a graph has low homophily -namely, nodes that connect tend to have different labels -it is said to be heterophilous. In other words, a graph is heterophilous if neighboring nodes do not share similar labels. Monophily: Generally, we define a graph to be monophilous if the label of a node is similar to that of its neighbors' neighborsfoot_2 . Etymologically, the word "monophily" is derived from the Greek words "monos" (unique) and "philos" (friend), which in our context means that a node -regardless of its label -has neighbors of primarily one label. In the context of a directed graph, monophily can be thought of as a structure that resembles Fig. 2 (a) where similar nodes (in this case, three green nodes connected to a yellow node) are connected to a node with a different label. We argue that encoding monophily into a model can be helpful for both heterophilous and homophilous graphs (see Figs. 3(b) and 3(c)), which is one of the main motivators behind our work. In homophilous graphs, monophily will fundamentally encode the 2nd-hop neighbor's label information, and since in such graphs, neighboring nodes have similar labels, it can provide a helpful signal for node classification. In heterophily, neighboring nodes have different labels, but the 2nd-hop neighbors may share the same label, providing helpful information for node classification. Monophily is effective for heterophilous graphs (Lim et al., 2021) . Therefore, an approach encoding monophily has an advantage over methods designed specifically for homophilous and heterophilous graphs, especially when varying levels of homophily can exist between different sub-regions in the same graph (see Section 3.3). It may also not be apparent if the (sub-)graph is purely homophilous/heterophilous (since these are not binary constructs), which makes a unified architecture that can leverage graph information for both settings all the more important.

3.1. COMPONENTS & MOTIVATION

The desiderata we laid down on Section 1 can be realized by three components: (i) PEs, (ii) ego embeddings, and (iii) label propagations that encode monophily. More specifically, ego embeddings and PEs are used as primary features, which have been shown to work for both homophilous and heterophilous graphs for the models we end up training. Finally, the propagation step is used to encode monophily to provide additional information to our final prediction. Positional Embeddings: We use PEs to provide our model information about the position of each node and hypothesize that PEs are an important piece of information in the context of large-scale node classification. PEs have been used to help discriminate isomorphic graph (sub)-structures (Kim et al., 2022; Dwivedi et al., 2021; Srinivasan & Ribeiro, 2019) . This is useful for both homophily (Kim et al., 2022; Dwivedi et al., 2021) and heterophily (Lim et al., 2021) because isomorphic (sub)-structures can exist in both the settings. In the homophilous case, adding positional information can help distinguish nodes that have the same neighborhood but distinct position (Dwivedi et al., 2021; Morris et al., 2019; Xu et al., 2019) , circumventing the need to do higher-order propagations (Dwivedi et al., 2021; Li et al., 2019; Bresson & Laurent, 2017) which are prone to over-squashing (Alon & Yahav, 2021) . In heterophily, structural similarity among nodes is important for classification, as in the case of LINKX -where adjacency embedding can be considered a PE. However, in large graphs, using adjacency embeddings or Laplacian eigenvectors (as methods such as (Kim et al., 2022) suggest) can be a computational bottleneck and may be infeasible. In this work, we leverage knowledge graph embeddings (KGEs) to encode positional information about the nodes, and embed the graph. Using KGEs has two benefits: Firstly, KGEs can be trained quickly for large graphs. This is because KGEs compress the adjacency matrix into a fixed-sized embedding, and adjacency matrices have been shown to be effective in heterophilous cases. Further, KGEs are lower-dimensional than the adjacency matrix (e.g., d P ⇠ 10 2 ), allowing for faster training and inference times. Secondly, KGEs can be pre-trained efficiently on such graphs (Lerer et al., 2019) and can be used off-the-shelf for other downstream tasks, including node classification (El-Kishky et al., 2022)foot_3 . So, in the 1st Stage of our methods in Alg. 1 (Fig. 1(a) ) we train KGEs model on the available graph structure. Here, we fix this positional encoding once they are pre-trained for downstream usage. Finally, we note that this step is transductive but we can easily make it inductive (El-Kishky et al., 2022; Albooyeh et al., 2020) . Ego Embeddings: We get ego embeddings from the node features. Such embeddings have been used in homophilous and heterophilous settings (Lim et al., 2021; Zhu et al., 2020) . Node embeddings are useful for tasks where the graph structure provides little/no information about the task.

Monophilous Label Propagations:

We now propose a novel monophily (refer Section 2.5) inspired label propagation which we refer to as Monophilous Label Propagation (MLaP). MLaP has the advantage that we can use it both for homophilous and heterophilous graphs or in a scenario with varying levels of graph homophily (see Section 3.3) as it encodes monophily (Section 2.5). To understand how MLaP encodes monophily, we consider the example in Fig. 2 . In this example, we have three green nodes connected to a yellow node and two nodes of different colors connected to the yellow node. Then, one way to encode monophily in Fig. 2 (a) while predicting label for j `, `2 [5], is to get a distribution of labels of nodes connected to node i thus encoding its neighbors' distribution. The fact that there are more nodes with green color than other colors can be used by the model to make a prediction. But this information may only sometimes be present, or there may be few labeled nodes around node i. Consequently, we propose to use a model that predicts the label distribution of nodes connected to i. We use the node features (x i ) and PE (p i ) of node i to build this model since nodes that are connected to node i share similar labels, and thus, the features of node i must be predictive of its neighbors. So, in Fig. 2 (a), we train a model to predict a distribution of i's neighbors. Next, we provide j `the learned distribution of i's neighbors by propagating the learned distribution from i back to j `. Eqs. (1) to (3) correspond to MLaP. We train a final model that leverages this information together with node features and PEs (Fig. 2 (b)).

3.2. OUR METHOD: GLINKX

We put the components discussed in Section 3.1 together into three stages. In the first stage, we pre-train the PEs by using KGEs. Next, encode monophily into our model by training a model that predicts a node's neighbors' distribution and by propagating the soft labels from the fitted model. Finally, we combine the propagated information, node features, and PEs to train a final model. GLINKX is described in Alg. 1 and consists of three main components detailed as block diagrams in , for a node we want to learn the distribution of its neighbors. To achieve this, we propagate the labels from a node's neighbors (we call this step MLaP Forward), i.e. calculate ŷi = P j2Vtrain:(j,i)2Etrain y j |{j 2 V train : (j, i) 2 E train }| 8i 2 V train . (1) For node i we want to learn a model that takes i's features x i 2 R d X , and PEs p i 2 R d P and predict a value e y i 2 R c that matches the label distribution of it's neighbors neighbors ŷi using a shallow model. Next, we want to propagate (outside the training loop) the (predicted) distribution of a node back to its neighbors and use it together with the ego features and the PEs to make a prediction about a node's own label. We propagate ỹi to its neighbors j 1 to j 5 . For example, for j 1 , we encode the propagated distribution estimate ỹi from i to form y 0 j1 . We predict the label by using y 0 j1 , x j1 , p j1 . Then, we train a model that predicts the distribution of neighbors, which we denote with ỹi using the ego features {x i } i2Vtrain and the PEs {p i } i2Vtrain and maximize the negative cross-entropy with treating {ŷ i } i2Vtrain as ground truth labels, namely we maximize L CE, 1 (✓ 1 ) = X i2Vtrain X l2[c] ŷi,l log(ỹ i,l ), where ỹi = f 1 (x i , p i ; ✓ 1 ) and ✓ 1 2 ⇥ 1 is a learnable parameter vector. Although in this paper we assume to be in the transductive setting, this step allows us to be inductive (see App. B). In Section 3.4 we give a theoretical justification of this step, namely "why is it good to use a parametric model to predict the distribution of neighbors?". Finally, we propagate the predicted soft-labels ỹi back to the original nodes, i.e. calculate y 0 i = P j2V :(i,j)2E ỹj |{j 2 V : (i, j) 2 E}| 8i 2 V train , where the soft labels {ỹ i } i2Vtrain have been computed with the parameter ✓ ⇤ 1 of the epoch with the best validation accuracy from model f 1 (•|✓ 1 ). We call this step MLaP Backward. 3rd Stage (Final Model): We make the final predictions y final, i = f 2 (x i , p i , y 0 i ; ✓ 2 ) by combining the ego embeddings, PEs, and the (back)-propagated soft labels (✓ 2 is a learnable parameter vector). We use the soft-labels ỹi instead of the actual labels one-hot (y i ) in order to avoid label leakage, which hurts performance (see also (Shi et al., 2020) for a different way to combat label leakage). Finally, we maximize the negative cross-entropy with respect to a node's own labels, L CE, 2 (✓ 2 ) = X i2Vtrain X l2[c] I{y i = l} log(y final, i,l ), Overall, Stage 2 corresponds to learning the neighbor distributions and propagating these distributions, and Stage 3 uses these distributions to train a new model which predicts a node's labels. In Section 3.4, we prove that such a two-step procedure incurs lower errors than directly using the features to predict a node's labels. Scalability: GLINKX is highly scalable as it performs message passing a constant number of times by paying an O(mc) cost, where the dimensionality of classes c is usually small (compared to d X that GCNs rely on). In both Stages 2 and 3 of Alg. 1, we train node-level MLPs, which allow us to leverage i.i.d. (row-wise) mini-batching, like tabular data, and thus our complexity is similar to other shallow methods (LINKX, FSGNN) (Lim et al., 2021; Maurya et al., 2021) . This, combined with the ) and a heterophilous (Fig. 3(c )) region in the same graph that are both monophilous, namely they are connected to many neighbors of the same kind. In a spam network, the homophilous region corresponds to many non-spam reviews connecting to non-spam reviews (which is the expected behaviour of a non-spammer user), and the heterophilous region corresponds to spam reviews targeting non-spam reviews (which is the expected behaviour of spammers), thus, yielding a graph with both homophilous and heterophilous regions such as in Fig. 3(a) . propagation outside the training loops, circumvents the scalability issues of GCNs. For more details, refer App. A.3.

3.3. VARYING HOMOPHILY

Graphs with monophily experience homophily, heterophily, or both. For instance, in the yelp-chi dataset -where we classify a review as spam/non-spam (see Fig. 3 ) -we observe a case of monophily together with varying homophily. Specifically in this dataset, spam reviews are linked to non-spam reviews, and non-spam reviews usually connect to other non-spam reviews, which makes the node homophily distribution bimodal. Here the 2nd-order similarity makes the MLaP mechanism effective.

Justification of Stage 2:

In Stage 2, we train a parametric model to learn the distribution of a node's neighbors from the node features ⇠ ifoot_4 . Arguably, we can learn such a distribution naïvely by counting the neighbors i that belong to each class. This motivates our first theoretical result. In summary, we show that training a parametric model for learning the distribution of a node's neighbors (as in Stage 2) yields a lower error than the naïve solution. Below we present the Thm. 1 (proof in App. F) for undirected graphs (the case of directed graphs is the same, but we omit it for simplicity of exposition): Theorem 1. Let G([n], E) be an undirected graph of minimum degree K > c 2 and let Q i 2 c be the likelihood, from the viewpoint of node i, of any node in its neighborhood N (i) to be assigned to different classes for every node i 2 [n]. The following two facts are true (under standard assumptions for SGD and the losses): 1. Let b Q i be the sample average of Q i , i.e. b Q i,j = 1 |N (i)| P k2N (i) I{y k = j}. Then, for every i 2 [n], we have that max j2[c] E[|Q i,j b Q i,j |]  E[kQ i b Q i k 1 ]  O ✓ q log(Kc) K ◆ . 2. Let q(•|⇠ i ; ✓) be a model parametrized by ✓ 2 R D that uses the features ⇠ i of each node i to predict Q i . We estimate the parameter ✓ 1 by running SGD for t = n steps to maximize L(✓) = 1 n P n i=1 P c j=1 Q i,j log q(j|⇠ i ; ✓). Then, for every i 2 [n], we have that max j2[c] E[|q(j; ⇠ i ; ✓ 1 ) Q i,j |]  O ✓ q log n n ◆ . It is evident here that if the minimum degree K is much smaller than n, then the parametric model has lower error than the naïve approach, namely Õ(n 1/2 ) compared to Õ(K 1/2 ). Justification of Stages 2 and 3: We now provide theoretical foundations for the two-stage approach. Specifically, we argue that a two-stage procedure involving learning the distribution of a node's 2nd-hop neighbor distributions (we assume for simplicity, again, that the graph is undirected) first with a parametric model such as in Thm. 1, and then running a two-phase algorithm to learn a parametric model that predicts a node's label, yields a lower error than naïvely training a shallow parametric model to learn a node's labels. The first phase of the two-phase algorithm involves training the model first by minimizing the cross-entropy between the predictions and the 2nd-hop neighborhood distribution. Then the model trains a joint objective that uses the learned neighbor distributions and the actual labels starting from the model learned in the previous phase. Theorem 2. Let G([n], E) be an undirected graph of minimum degree K > c 2 and, let P i be the likelihood of node i to be assigned to a different class, and let Q i , q(•|⇠ i ; ✓ 1 ) defined as in Thm. 1. Let p(•|⇠ i ; w) be a model parametrized by w 2 R D that is used to predict the class assignments y i ⇠ p(•|⇠ i ; w). Let w ⇤ be the optimal parameter. The following are true (under standard assumptions for SGD and the losses): 1. The naïve optimization scheme that runs SGD to maximize G(w) = 1 n P n i=1 P c j=1 P i,j log p(j|⇠ i ; w) for n steps has error E[G(w n+1 ) G(w ⇤ )]  O ⇣ log n n ⌘ . 2. The two-phase optimization scheme that runs SGD to maximize b G(w) = 1 n P n i=1 P c j=1 1 |N (i)| P k2N i) q(j|⇠ k ; ✓ 1 ) log p(j|⇠ i ; w) for n 1 steps, to estimate a solution w 0 and then runs SGD on the objective b G(w) + ( 1)G(w) for n steps starting from w 0 , achieves error E[G(w n+1 ) G(w ⇤ )]  O ⇣ p log n log log n n ⌘ . You can find the proof in App. F. We observe that the two-phase optimization scheme can reduce the error by a factor of p log n/ log log n highlighting the importance of using the distribution of the 2nd-hop neighbors of a node to predict its label and holds regardless of the homophily properties of the graph. Also, note that the above two-phase optimization scheme differs from the description of the method we gave in Alg. 1. The difference is that the distribution of a node's neighbors is embedded into the model in the case of Alg. 1, and the distribution of a node's neighbors are embedded into the loss function in Thm. 2 as a regularizer. In Alg. 1, we chose to incorporate this information in the model because using multiple losses harms scalability and makes training harder in practice. In the same spirit, the conception of GCNs (Kipf & Welling, 2016) replaces explicit regularization with the graph Laplacian with the topology into the model (see also (Hamilton et al., 2017; Yang et al., 2016) ).

3.5. COMPLEMENTARITY

Different components of GLINKX provide a complementary signal to components proposed in the GNN literature (Maurya et al., 2021; Zhang et al., 2022b; Rossi et al., 2020) . One can combine GLINKX with existing architectures (e.g. feature propagations (Maurya et al., 2021; Rossi et al., 2020) , label propagations (Zhang et al., 2022b )) for potential metric gains. For example, SIGN computes a series of r 2 N feature propagations [X, X, 2 X, . . . , r X] where is a matrix (e.g., normalized adjacency or normalized Laplacian) as a preprocessing step. We can include this complementary signal, namely, embed each of the propagated features and combine them in the 3rd Stage to GLINKX. Overall, although in this paper we want to keep GLINKX simple to highlight its main components, we conjecture that adding more components to GLINKX would improve its performance on datasets with highly variable homophily (see Section 3.3).

4. EXPERIMENTS & CONCLUSIONS

Comparisons: We experiment with homophilous and heterophilous datasets (see Tab. 1 and App. D.3). We train KGEs with Pytorch-Biggraph (Lerer et al., 2019; Yang et al., 2014) . For homophilous datasets, we compare with vanilla GCN and GAT, FSGNN, and Label Propagation (LP). For a fair comparison, we compare with one-layer GCN/GAT/FSGNN/LP since our method is one-hop. We also compare with higher-order (h.o.) GCN/GAT/FSGNN/LP with 2 and 3 layers. In the heterophilous case, we compare with LINKXfoot_5 because it is scalable and is shown to work better than other baselines (e.g., H2GCN, MixHop, etc.) and with FSGNN. Note that we do not compare GLINKX with other more complex methods because GLINKX is complementary to them (see Section 3.5), and we can incorporate these designs into GLINKX. We use a ResNet module to combine our algorithm's components from Stages 2 and 3. Details about the hyperparameters we use are in App. C. In the heterophilous datasets, GLINKX outperforms LINKX (except for arxiv-year, where we are within the confidence interval). Moreover, the performance gap between using KGEs and adjacency embeddings shrinks as the dataset grows. In the homophilous datasets, GLINKX outperforms 1-layer GCN/GAT/LP/FSGNN and LINKX. In PubMed, GLINKX beats h.o. GCN/GAT and in arxiv-year GLINKX is very close to the performance of GCN/GAT. Finally, we note that our method produces consistent results across regime shifts. In detail, in the heterophilous regime, our method performs on par with LINKX; however, when we shift to the homophilous regime, LINKX's performance drops, whereas our method's performance continues to be high. Similarly, while FSGNN performs similarly to GLINKX on the homophilous datasets, we observe a significant performance drop on the heterophilous datasets (see arxiv-year). Ablation Study: We ablate each component of Alg. 1 to see each component's performance contribution. We use the hyperparameters of the best model from Tab. 1. We perform two types of ablations: (i) we remove each of the components from all stages of the training, and (ii) we remove the corresponding components only from the 3rd Stage. Except for removing the PEs from the 3rd Stage only on ogbn-arxiv, all components contribute to increased performance on both datasets. Note that adding PEs in the 1st Stage does improve performance, suggesting the primary use case of PEs.

Conclusion:

We present GLINKX, a scalable method for node classification in homophilous and heterophilous graphs that combines three components: (i) ego embeddings, (ii) PEs, and (iii) monophilous propagations. As future work, (i) GLINKX can be extended in heterogeneous graphs, (ii) use more expressive methods such as attention or Wasserstein barycenters (Cuturi & Doucet, 2014) for averaging the low-dimensional messages, and (iii) add complementary signals.



For this paper, we operate in the transductive setting. See App. B for the inductive setting. We use ego embeddings and node features interchangeably. A similar definition of monophily has appeared in (Altenburger & Ugander, 2018a), whereby many nodes have extreme preferences for connecting to a certain class. Positional information can also be provided by other methods such as node2vec(Grover & Leskovec, 2016), however, most of such methods are less scalable. In Section 3.1, ⇠is correspond to the augmented features ⇠i =[xi; pi] We have run our method with hyperparameter space that is a subset of the sweeps reported in(Lim et al., 2021) due to resource constraints. A bigger hyperparameter search would improve our results.



Figure 1: Block Diagrams of GLINKX stages.

Fig. 2 shows the GLINKX stages from Alg. 1 on a toy graph: 1st Stage (KGEs): We train DistMult KGEs with Pytorch-Biggraph (Yang et al., 2014) treating G as a knowledge graph with only one relation (see App. A.4 for more details). Here we have decided to use DistMult, but one can use their method of choice to embed the graph. 2nd Stage (MLaP): First (2nd Stage in Alg. 1, Fig. 1(b), and Fig. 2(a))

Figure 2: Example.For node i we want to learn a model that takes i's features x i 2 R d X , and PEs p i 2 R d P and predict a value e y i 2 R c that matches the label distribution of it's neighbors neighbors ŷi using a shallow model. Next, we want to propagate (outside the training loop) the (predicted) distribution of a node back to its neighbors and use it together with the ego features and the PEs to make a prediction about a node's own label. We propagate ỹi to its neighbors j 1 to j 5 . For example, for j 1 , we encode the propagated distribution estimate ỹi from i to form y 0 j1 . We predict the label by using y 0 j1 , x j1 , p j1 .

Figure 3: Top: Node and class homophily distributions for the yelp-chi dataset. Bottom: Examples of a homophilous (Fig.3(b)) and a heterophilous (Fig.3(c)) region in the same graph that are both monophilous, namely they are connected to many neighbors of the same kind. In a spam network, the homophilous region corresponds to many non-spam reviews connecting to non-spam reviews (which is the expected behaviour of a non-spammer user), and the heterophilous region corresponds to spam reviews targeting non-spam reviews (which is the expected behaviour of spammers), thus, yielding a graph with both homophilous and heterophilous regions such as in Fig.3(a).

node features X, labels Y Output: Node Label Predictions Y final 1st Stage (KGEs). Pre-train knowledge graph embeddings P with Pytorch Biggraph.

Experimental results. (*) = results from the OGB leaderboard.

Ablation Study. We use the hyperparameters of the best run from Tab. 1 with KGEs.

