PROTOGNN: PROTOTYPE-ASSISTED MESSAGE PASS-ING FRAMEWORK FOR NON-HOMOPHILOUS GRAPHS

Abstract

Many well-known Graph Neural Network (GNN) models assume the underlying graphs are homophilous, where nodes share similar features and labels with their neighbours. They rely on message passing that iteratively aggregates neighbour's features and often suffer performance degradation on non-homophilous graphs where useful information is hardly available in the local neighbourhood. In addition, earlier studies show that in some cases, GNNs are even outperformed by Multi-Layer Perceptron, indicating insufficient exploitation of node feature information. Motivated by these two limitations, we propose ProtoGNN, a novel message passing framework that augments existing GNNs by effectively combining node features with structural information. ProtoGNN learns multiple class prototypes for each class from raw node features with the slot-attention mechanism. These prototype representations are then transferred onto the structural node features with explicit message passing to all non-training nodes irrespective of distance. This form of message passing, from training nodes to class prototypes to non-training nodes, also serves as a shortcut that bypasses local graph neighbourhoods and captures global information. ProtoGNN is a generic framework which can be applied onto any of the existing GNN backbones to improve node representations when node features are strong and local graph information is scarce. We demonstrate through extensive experiments that ProtoGNN brings performance improvement to various GNN backbones and achieves state-of-the-art performance on several non-homophilous datasets.

1. INTRODUCTION

Graph Neural Networks (GNNs) have emerged as prominent models for learning representations on graph-structured data. GNNs iteratively update the node embeddings based on the node features of its own and its local neighbors on the graph (Kipf & Welling, 2016; Defferrard et al., 2016; Gilmer et al., 2017) . This form of iterative message passing provides strong structural inductive bias that assumes the presence of useful information in local neighborhoods. This assumption holds true for homophilous graphs whose connected nodes tend to share similar features and labels. However, many real-world graphs are structured in a heterophilous way wherein nodes tend to be dissimilar with their local neighbors. Such non-homophilous graphs can be found in domains of fraud detection (Shi et al., 2022; Pandit et al., 2007) , molecular biology (Ye et al., 2022) and certain social networks (Lim et al., 2021) . Recent studies show that many GNNs (e.g. GCN (Kipf & Welling, 2016) ) fail to learn well on nonhomophilous graphs (Zhu et al., 2020b; 2021) , and are even outperformed by simple architectures such as Multi-Layer Perceptron (MLP) that ignore graph structures and only leverage node features. This indicates two limitations of these GNNs. First, information of node features is not sufficiently exploited and might be diluted by iterative message passing. Second, the prevailing structural inductive bias that utilizes local neighbourhood information may not be helpful on non-homophilous graphs, since useful information is often scarce within the local neighbourhood. Therefore, it is essential to go beyond the local neighborhood for useful information. Existing works either explicitly capture multi-hop information (Zhu et al., 2020b) or try to leverage globally available correlations with distant nodes (Suresh et al., 2021; Li et al., 2022) . However, such methods often go beyond the linear complexity of GNN message passing (Suresh et al., 2021) . More importantly, since relevant information is often scattered within the vast amount of distant nodes, identifying and extracting such information is a challenging task. To address the two limitations, we draw inspiration from prototypical networks (Snell et al., 2017) , an efficient and effective paradigm that constructs a single representation for each class called a prototype to assist learning in data-scarce settings. Learning node representations on graph structures using class prototypes can facilitate information flow from distant nodes and capture global correlations efficiently. We observe that in many datasets, nodes from the same class form multiple clusters (see Fig. 1(a) ). If we use a single prototype for each class, the prototype representation would be inaccurate or insufficient. For example, the single prototype may learn to represent the average of different clusters. Alternatively, the single prototype may learn to represent one of the dominant clusters in the feature space, therefore failing to capture information of the remaining clusters. Hence using a single prototype for each class may lead to suboptimal performance. In this work, we propose ProtoGNN, a novel message-passing framework that augments existing GNNs and overcomes two limitations of GNNs on heterophilous graphs by facilitating efficient information transfer from distant nodes. In ProtoGNN, we first disentangle the two sources of information from the graph in the form of node feature view and structure view. With the node features of the training nodes, we construct multiple prototypes from each class independent of graph structure by adapting the mechanism of slot-attention (Locatello et al., 2020) . These class prototypes are learnt from the node feature space via multiple rounds of attention. It provides an effective way of soft-clustering the node features as well as the needed model capacity for learning different clusters within the same class. Alternatively, this process can also be viewed as message passing from training nodes to prototype nodes as illustrated in Fig. 1(b ). The learned prototype representations are then transferred onto all node representations from the structure view to make them more discriminative. To enhance the effectiveness of ProtoGNN, we design two regularizers. To exploit the presence of the feature and structure views for regularization, we enforce cross-view compatibility between the prototypes (feature view) and node embeddings (structure view) through a hierarchical compatibility function which handles the presence of multiple prototypes within each class. To further exploit the full power of prototypes, we regularize them so that they are distinct from each other by encouraging the prototypes to be orthogonal to each other through an orthogonality loss. The learning of prototypes introduces artificial edges from training nodes of each class to all nontraining nodes via prototype nodes. This serves as a shortcut in message passing which bypasses local graph neighbourhoods and captures global information, while maintaining the linear complexity with respect to the backbone GNNs. Additionally, since the prototypes are learnt from node feature space, it preserves strong feature information that might be diluted by traditional message passing. Overall, ProtoGNN is a generic framework which presents the following advantages: (1) It is orthogonal to existing GNN backbones and can be applied onto any of them to improve node representations in heterophilous graphs. (2) It is efficient and can capture global node correlations with only O(n) additional edges. (3) It can be useful in graphs where distant information is helpful and node feature information is strong, even in homophilous settings when label rate is low. (4) It preserves the node features from dilution due to message passing. We conduct extensive experiments on several non-homophilous datasets which show consistent improvement over various backbone GNN models, achieving state-of-the-art performance in multiple instances.

2. RELATED WORKS

GNN models like GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017) have performed profoundly well on graph-structured data in a multitude of domains. However, GCN-like models have an intrinsic assumption of homophily on graphs, and it has been noted by multiple works that they perform poorly on heterophilous graphs (Pei et al., 2020; Zhu et al., 2020b) . Early works tried to address this issue either through multi-hop neighbourhood aggregation or building auxiliary graph structures based on node and structural features. Since then, various ways have been proposed to make GNNs work better on non-homophilous graphs (Bo et al., 2021; Yan et al., 2021; Chien et al., 2020; Dong et al., 2021; Suresh et al., 2021; Yang et al., 2021; Lim et al., 2021; Luan et al., 2021; Zhu et al., 2020a; Liu et al., 2021) . Some of these methods leverage global information by letting the model choose either proximity or structural information or both for prediction (Suresh et al., 2021) , while other methods like GGCN (Yan et al., 2021) address this by introducing signed messages in the propagation scheme based on over-smoothing effect observed in GCN-like models. More recently, LINKX (Zhu et al., 2021) showed that a simple MLP applied on node features and adjacency matrix can perform well on many heterophilous graph datasets. In addition, GloGNN (Li et al., 2022) showed substantial improvements when further global information is captured. However, despite introducing an efficient approach to do this, it can take quadratic runtime when the graph is dense. In this work, we capture relevant information from distant nodes, which is often scattered within the graph structure. Identifying and extracting such information is a challenging task. Inspired by prototypical networks (Snell et al., 2017) , we construct class prototypes to enable efficient information transfer from distant nodes with linear complexity. Furthermore, with the observation that node features occur in several clusters in heterophilous graphs, we learn multiple prototypes for each class with an adapted form of slot-attention (Locatello et al., 2020) . There are also a few works that use prototype-based learning in graphs. One uses prototypes to make the predictions more explainable (Zhang et al., 2022) while another considers few-shot setting (Ding et al., 2020) . However, they do not integrate prototype learning with backbone GNNs, use multiple prototypes per class or consider heterophilous graphs. Very recently, Xu et al. (2022) proposed using memory units to capture global patterns in the graph. However, unlike Xu et al. (2022) , we train multiple prototypes per class and are inductively-biased with slot attention mechanism to capture node clusters.

3. PRELIMINARIES

Notations. Let G = (V, E) represent an undirected graph, where V = {v i } n i=1 is the set of nodes and E ⊆ V × V is the set of edges on the graph. A denotes the adjacency matrix of the graph, where A ij represents the edge e ij between nodes v i and v j . X represents the feature matrix of nodes, where X i is the features of node i, and X train and X nontrain are the feature matrices for all the training and non-training nodes, respectively. Y denotes the node class labels, and y i is the label for node v i . Homophily and heterophily Homophily refers to the degree of similarity between connected neighboring nodes in terms of their features or labels. There are many types of homophily measures proposed, including edge homophily (Zhu et al., 2020b) , node homophily (Pei et al., 2020) , and class insensitive homophily (Lim et al., 2021) . In this work, we use the homophily matrix proposed in Lim et al. (2021) , since it can better reflect class-wise homophily. The homophiliy matrix is defined as: H c1,c2 = |(u, v) ∈ E : c u = c 1 , c v = c 2 | |(u, v) ∈ E : c u = c 1 | , for classes c 1 and c 2 , H c1c2 denotes the proportion of edges between from nodes of class c 1 to nodes of class c 2 . A homophilous graph has high values on the diagonal entries.

4. PROTOGNN

With the aim of capturing scattered but useful information from distant nodes without diluting the strong node features, we propose ProtoGNN, which learns the node representations by decoupling the feature view and the structural view from the graph and fusing them to produce effective representations. ProtoGNN can be broken down into the following key components: (1) a feature view which involves learning multiple prototype representations for each class from only node features, (2) a structure view that consists of any GNN model, (3) information fusion which combines information from the two views via the attention mechanism. The whole pipeline is trained with three loss functions. In the following, we discuss the three stages in details. Feature View: In order to fully exploit the node features and be able to transfer information to distant nodes, we learn prototype representations of each class without the structural inductive bias. This forms the feature view of the graph, wherein nodes are clustered in feature space irrespective of their topological affinities. The prototype features can then be used to transfer information onto the structural representations for the model to learn the most effective embeddings adaptively. In real-world graphs, we observe that the feature distribution for each class of nodes is usually multi-modal, as illustrated by the feature space visualization in Fig. 1(a) . Using a single prototype to represent multi-modal distributions is insufficient and often inaccurate. Therefore, we design multiple prototype nodes for each class to enable them to capture different modes in feature distribution. The embeddings of these prototype nodes are learned via an adapted slot attention mechanism (Locatello et al., 2020) . In the original setting, a fixed number of slots are learnt to represent different objects in computer vision tasks. In ProtoGNN, we extend slot attention mechanism to learn the prototypes/clusters of node features of each class. The learning is done via attention where the class prototypes compete for explanations of each node, resulting in soft clustering effects such that each slot dominates in certain groups of nodes. For learning the prototypes, we first transform the node features using an MLP to generate the node embeddings. These embeddings are averaged and used to initialize the mean µ of a Gaussian distribution N (µ, Σ) with randomly initialized and learned co-variance Σ. We initialize K prototypes for each class c, denoted by P c ∈ R K×d , by sampling from the distribution N (µ, Σ). We then iteratively update the embeddings. In each iteration, we compute a dot product attention (Luong et al., 2015) between the prototypes and nodes, followed by a softmax operation that normalizes over the prototypes, such that the assignment scores to each prototype sum up to one for a given node. In this way, the prototypes will compete for an assignment of nodes which acts as a soft clustering on the node feature space. Specifically, the attention weights M c attn for class c is X = M LP (X raw ), M c attn = Softmax [P c W Q ] • [W K (X c train ) T ] √ d +ϵ ∈ R K×N c train , where d is the hidden dimension, N c train is the number of training nodes belonging to class c, W Q and W K are query and key transformation matrices respectively, and ϵ is a small coefficient added for numerical stability. Then, we normalize the attention matrix M c attn over the training node dimension, so that for each prototype, its assigned weights in all training nodes sum up to one. We then update the prototype embeddings P c using the weighted average of training node features X c train . P c = Norm(M c attn )X c train W V ∈ R K×d (4) where W V is a parameter matrix. The learning of prototypes serves as a soft clustering of feature space. It can also be viewed as message passing from training nodes to prototypes via artificially added edges. Structure View. We use GNN to learn the structure view representation of nodes. GNN takes both node features and adjacency matrix as inputs. After each layer of message passing or processing, GNN produces a structure view node embedding matrix H l ∈ R N ×d . Note that we can use any GNN as a backbone to generate the structure view embeddings. H l = GN N l (A, H l-1 ) where H 0 = X raw Information Transfer/Fusion. Once we have all the class prototypes P from the feature view and node embeddings H from the structure view, we can then transfer the information from prototype nodes onto the structure view embeddings. This is done by first computing dot product attention between prototypes from all classes and all non-training nodes, and then using the softmax function to normalize over the prototype dimension, so that the weight of all prototypes sum up to one for each non-training node. S attn = Softmax(H l nontrain P T ) ∈ R Nnontrain×CK (6) Next, we compute the message to be passed from prototype nodes to non-training nodes by taking the weighted average of prototype embeddings according to the dot product attention weights. Finally, we fuse the prototype messages from the feature view with the node embeddings from the structure view through an aggregation function. We allow the structure view of node embeddings to be taken from any layer of the GNN, making it an architectural choice. H l = AGG(H l , M proto ) where M proto = S attn P This information transfer process can also be viewed as a message passing from prototype nodes to non-training nodes. Therefore, the ProtoGNN framework essentially constructs a shortcut between all the nodes in the graph. This is especially beneficial in the non-homophilous setting where useful information is less available in the graph local neighborhood. This also helps in some homophilous settings as well when useful information is distant. Training. To exploit the full potential of the multiple prototypes of each class, we regularize training with two additional loss functions besides the cross-entropy loss. First, the cross-view compatibility loss regularizes by encouraging the structure and feature views to agree. Second, the orthogonality loss encourages the prototypes to be distinct from each other. We modify the compatibility loss function from Snell et al. (2017) , making it hierarchical in order to handle multiple prototypes within each class. This is done with the help of an additional aggregation function. We first compute the similarities between the prototypes of a class and a node embedding in order to compute the likelihood that the embedding belongs to the class. The similarity scores with the multiple prototypes are then aggregated into a single score for the class. Multiple choices of aggregators are possible, e.g. we could compute the average similarities between the embedding and the prototypes in the class. We use the max aggregator in order to select the most similar prototype within the class to the node. The compatibility loss is then computed as follows: s c i = max k∈K Λ(f (h L i ), p c k ) L comp = 1 CN train i∈N s ci i + log c ′ ̸ =c exp ( -s c ′ i ) where N train is the number of training nodes, C is number of classes, c i is the class-label of i th training node, p c k is the k th prototype of class c, h L i is the node embedding before final classification, f is an MLP and Λ is a similarity function for which we use dot product. To exploit the full power of prototypes, we regularize them so that they are distinct from each other. We do that by enforcing the prototypes to be orthogonal to each other through an orthogonality loss shown to be useful according to Bianchi et al. (2020) . L ortho = P c T P c ∥P c T P c ∥ F - I d √ d F ( ) where ∥•∥ F indicates the Frobenius norm. With the α and β as tuning hyperparameters, the full loss function used for training is L = L ce + αL comp + βL ortho (11) Table 1: comparison H2GCN WR-GAT GLOGNN PROTOGNN MEAN O(n 2 ) O(n log n) O(dCn) O(KCn) Complexity. In Tab. 1, we compare the complexity of ProtoGNN with other baseline models which add additional edges. Particularly, GloGNN (Li et al., 2022) , though efficient in sparse graphs with small d (average degree), can go quadratic for dense graphs. ProtoGNN captures global information with only O(n) additional edges. More specifically, it contains additional KCn virtual edges, when the graph contains C classes with K prototypes per class. Note that both K, C ≪ n and hence, it preserves the linear complexity of GNNs.

Datasets

We conduct experiments on 10 datasets, including 7 non-homophilous datasets: Penn94, Pokec, Genius, Twitch-gamers, Cornell5 and Amherst41 released by Lim et al. (2021) , and USelection dataset that we identify to be non-homophilous (Jia & Benson, 2020) . We use the original train/validation/test splits if such splits exist. The train/validation/test split is 60%, 20%, 20% for US-election, and 50%, 25%, 25% for other datasets. We also evaluate the performance of ProtoGNN on three popular homophilous datasets: Cora, Citeseer and Pubmed (Pei et al., 2020) , with a low label rate setting so that a labeled node is unlikely to be close to a node that needs to be labeled. The statistics of the datasets are included in Tab. 2. More details of the datasets are in the appendix. Baselines ProtoGNN is a generic framework that can be applied onto any of the existing GNN backbones. We first add ProtoGNN to 4 backbones: (1) GCN (Kipf & Welling, 2016); (2) GCN with batch normalization, as we observe that batch normalization significantly improves performance in some datasets; (3) GCNII (Chen et al., 2020) (Zhu et al., 2020b) , WRGAT (Suresh et al., 2021) , GPR-GNN (Chien et al., 2020) ACM-GCN (Luan et al., 2021) , LINKX and GloGNN (Li et al., 2022) . Table 2 : Statistics of non-homophilous datasets and Classification performance of ProtoGNN vs backbones on non-homophilous datasets. Hom. Mat. refers to the homophily matrix described in Sec. 3. 0 and 1 stand for class labels. We report AUC for Genius (Lim et al., 2021) and accuracy for other datasets with standard deviation over 3 trials. OOM refers to the out-of-memory error. 

5.2.1. USING PROTOGNN TO ENHANCE EXISTING GNN BACKBONES

We are motivated to design ProtoGNN as a generic framework that is orthogonal to existing GNN models to boost their performance on heterophilous graphs. We therefore evaluate the effects of applying ProtoGNN onto various existing GNN backbones. We first apply ProtoGNN on GCN, one of the most widely used general-purpose GNNs, and observe performance improvement on 5 out of 7 datasets, with substantial gains on Penn94, Pokec, Cornell5, and US-election (Tab. 2). During experiments on GCN, we find that batch normalisation brings consistent performance boost to GCN on various non-homophilous datasets. We therefore add ProtoGNN on top of GCN with batch normalization to evaluate its performance on this GCN variant. GCN BN +ProtoGNN achieves consistent performance improvement on all datasets except Genius. We then apply ProtoGNN on GCNII, a deeper GNN model that is capable of retrieving global information by propagation through the additional layers. Though not designed for heterophilous graphs, it serves as strong baseline for heterophilous datasets. GCNII+ProtoGNN achieves better performance than the original GCNII on all datasets, with pronounced improvement on Amherst41. We next evaluate our framework's effectiveness on a backbone explicitly designed for heterophilous graphs. LINKX+ProtoGNN performs consistently better than LINKX on 6 out of 7 datasets, with substantial enhancement on US-election. Penn94 and Amherst41 datasets. Overall, the results demonstrate consistent improvement of ProtoGNN on various backbone GNNs and datasets. ProtoGNN achieves state-of-the-art on 6 out of 7 datasets compared to all baselines, except on Pokec. While some methods incur OOM errors when running on large-scale datasets, ProtoGNN does not suffer OOM due to its linear complexity with respect to backbones. To further evaluate the effects of shortcut construction, we consider a generalized scenario on homophilous graphs where information from labeled nodes needs to propagate a long distance to reach most unlabeled nodes. Specifically, we evaluate our method on Cora, CiteSeer and Pubmed with only a small number of training labels per class. We use a single prototype for each class without slot attention and obtain prototype representations by taking the average of training node features, since the training labels are scarce. Our hypothesis is that information propagation becomes more challenging with increasing scarcity of useful training information, making the effects of shortcut construction more pronounced in sparsely-annotated graph datasets. The hypothesis is confirmed by experiment results. As shown in Tab. 4, GCN+ProtoGNN beats vanilla GCN on all 3 datasets, with substantial improvements on Cora and Pubmed. 

Number of prototypes per class

Motivated by the observation that node features tend to form clusters in feature space illustrated in Fig. 1 , we assign multiple prototypes per class to capture the clusters. Our hypothesis is that multi-modal feature distribution is more accurately represented using multiple prototypes. We test the hypothesis by varying the number of prototypes per class in ablative experiments. We train Pro-toGNN with combinations of different numbers of prototypes and hidden dimensions. Fig. 4 illustrates ablative experiments of LINKX+ProtoGNN on Penn94. We observe that: (1) Pro-toGNN with a single prototype per class outperforms the backbone. (2) ProtoGNN with multiple prototypes per class perform consistently better than their single prototype counterparts. Our ablative experiments shows the effectiveness of using prototypes, as well as the benefits of using multiple prototypes per class over using a single one. (3) The optimal number of slots depends on a number of factors, such as hidden dimension, dataset feature space and backbone model type. Loss functions To achieve the full potential of the soft-clustering objective of our model, we add two additional losses to learn prototypes effectively. The compatibility loss enforces similarity in embeddings, while the orthogonality loss acts as a regularizer constraining the model to learn distinct prototype clusters for each class. In this ablation study, we evaluate the effect of these losses on model performance. We vary α and β parameters in Eqn. 11 from the set {0, 0.01, 0.1}, with hidden dimension fixed to 64 and the number of slots set to 5. Results in Tab. 5 show consistent performance improvements for all non-zero values of α and β on both datasets, compared to cross-entropy loss alone. This provides substantial evidence for the usefulness of the additional loss functions. Aggregation operation in Compatibility loss In order to make the node embeddings similar to prototype representations from the same class, we design the cross-view compatibility loss L comp . After computing the similarity between a node and each of the multiple prototypes from the same class, the choice of aggregation method is critical. We evaluate the effectiveness of using the aggregation operator on Amherst41 and Cornell5 datasets. Tab. 6 shows the effects of replacing the max aggregator with mean and sum in computing compatibility loss. On both datasets, max outperforms both sum and mean, indicating the effectiveness of using max as the aggregation operation. Intuitively, taking the average of all similarity scores (mean) is sub-optimal. mean tends to make the node embeddings closer to the average of all prototypes from the same class, undermining the purpose of using multiple prototypes to represent different clusters. Similar to mean, summing up all similarity scores (sum) is more powerful yet requires more data to learn. max selects the maximum similarity score to compute compatibility loss and guides the node embeddings closer to one of the class prototypes, thus preserving the power of diversity in representation. Node multi-prototype learning. We adapt slotattention mechanism to learn multiple prototypes for each class. In order to evaluate the effectiveness of learning prototypes with slot-attention, we ablate the model against two variants: 1)ProtoGNN-Naive, a variant where class prototypes are constructed by taking the average of all training node features from that class. 2) ProtoGNN-SA, where we adopt the classic self attention mechanism to learn class prototypes. We show results on Amherst41 and Cornell5 datasets keeping other hyperparameters fixed in both models. As shown in Tab. 7, we observe that ProtoGNN consistently outperforms both variants, signifying the effectiveness of using slot attention. Its better performance may be attributed to the ability to perform soft clustering that enables the multiple prototypes per class to capture the underlying clusters in feature space, which self-attention and taking the average of node features cannot do. A detailed comparison between ProtoGNN-SA and ProtoGNN can be found in Appendix.

6. CONCLUSION

In this work, we address the two challenges in heterophilic graphs of efficiently capturing scattered but useful global information from distant nodes and preserving node feature information from dilution. For this, we propose ProtoGNN, a prototype-assisted message passing framework that augments existing GNNs by effectively combining node feature information with structural information. Cora, Citeseer and Pubmed (Pei et al., 2020) are citation graphs, where each node represents a scientific paper and two papers are connected when a paper cites the other. Each node is labeled with the research field and the task is to predict which field the paper belongs to. All three datasets are homophilous.

A.2 HYPERPARAMETE DETAILS

• MLP: hidden dimension ∈ {16, 32, 64}, number of layers ∈ {2, 3}. Activation function is ReLU. • GCN: lr ∈ {.01, .001}, hidden dimension ∈ {4, 8, 16, 32, 64}, except for Pokec, where hidden dimension ∈ {4, 8, 16}. number of layers ∈ {2}. Activation function is ReLU. • GAT: lr ∈ {.01, .001}. For Pokec: hidden channels ∈ {4, 8, 12} and gat heads ∈ {2, 4}. For the remaining datasets: hidden channels ∈ {4, 8, 12, 32} and gat heads ∈ {2, 4, 8}. number of layers ∈ {2}. We use the ELU as activation. • MixHop: hidden dimension ∈ {8, 16, 32}, number of layers ∈ {2}. • GCNII: number of layers ∈ {2, 8, 16, 32, 64}, strength of initial residual connection α ∈ {0.1, 0.2, 0.5}, hyperparameter for strength of the identity mapping θ ∈ {0.5, 1.0, 1.5}. • H 2 GCN: hidden dimension ∈ {16, 32}, dropout ∈ {0, .5}, number of layers ∈ {1, 2}. Model architecture follows Section 3.2 of Zhu et al. (2020b) . • WRGAT: lr ∈ {.01}, hidden dimension ∈ {32}. • GPR-GNN: lr ∈ {.01, .05, .002}, hidden dimension ∈ {16, 32}. • ACM-GCN: lr ∈ {.01}, weight decay ∈ {5e -5 , 5e -4 , 5e -3 }, dropout ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, hidden channels ∈ {64}, number of layers ∈ {2}, display step ∈ {1}. • LINKX: hidden dimension ∈ {16, 32, 64}, number of layers ∈ {1, 2}. Rest of the hyperparameter settings follow Lim et al. (2021) . • GloGNN++: lr ∈ {.001, .005, .01} , weight decay ∈ {0, .01, .1}, dropout ∈ {0, .5, .8}, hidden channels ∈ {128, 256}, number of layers ∈ {1, 2}, α ∈ {0, 1}, β ∈ {0.1, 1}, γ ∈ {0.2, 0.5, 0.9}, δ ∈ {0.2, 0.5}, number of normalization layers ∈ {1, 2}, orders ∈ {1, 2, 3}, display step ∈ {1}. • ProtoGNN: lr ∈ {.005, .01}, weight decay ∈ {0, 0.001, .0005}, loss coefficients α, β ∈ {0, 0.01, 0.1}, prototypes per class (K) ∈ {1, 3, 5, 10}, hidden dimension ∈ {16, 32, 64}

A.3 TRAINING SETTINGS

We run each experiment with three different splits, and report the mean and standard deviation of the performance. All experiments are run on a single V100 GPU with 16G memory. We use Adam as the optimizer and perform grid search to find the best hyper-parameter combination, with details listed in A.2. We report the baseline results directly if they are already public.

A.4 FEATURE SPACE VISUALIZATION

To illustrate how node features form clusters in feature space, we plot the data points for 7 nonhomophilous datasets, with two shown in Fig. 1 . We reduce the node feature dimension to 2, sample 2000 nodes of the same class from each dataset, and plot them along the reduced dimensions. A.5 FURTHER ABLATION STUDY Self attention is a classic mechanism that can be used to learn the class prototypes. In this ablation study, we discuss in detail the rationality of using slot attention instead of self attention in our framework. In principle, slot attention is preferred over self attention because of the following reasons: Firstly, In slot attention, each slot/prototype will learn to compete for the explanation via the softmax-based attention mechanism. Therefore, the learning of multiple slots/prototypes serves as soft clustering of the training nodes. Self attention, on the other hand, does not act as soft clustering. As shown in Fig. 1 (a) and Fig. 6 , visualized features for Penn94 and Twitch Gamers do tend to form clusters in feature space. It is beneficial to be able to capture such clustering information, as illustrated in Figure 4 . Secondly, Slot attention is more efficient. In self attention, an attention score is computed for every pair of nodes, and thus the number of computations needed is O(n 2 ). However, in slot attention, we only need to compute the attention score between nodes and slots/prototypes,



Details of the datasets used can be found in Appendix A.1. For experiments, the settings can be found in Section 5.1 and details of the hyperparameter settings used for training are specified in Appendix A.2. To further facilitate reproducibility of this work, we will make the code publicly available at anonymised repository https://github.com/protognn/protognn.



Figure 1: (a) Visualization of features for Penn94 and Twitch Gamers datasets. The plots show clustering patterns of the nodes in the feature space. (b) Illustration of information transfer from training nodes to prototypes. Yellow and blue represent different classes. The prototype nodes learn multiple representations per class from the training nodes and transfer the information to non-training nodes through attention.

Figure 2: Overall architecture of ProtoGNN: ProtoGNN disentangles the input graph into two views, one considering only node features i.e. feature view, the other considering both graph structure and node features i.e.the structure view. In the feature view, ProtoGNN learns multiple prototype representations for each class from the node features. In the structure view, a backbone GNN generates node embeddings with the assistance of information from the feature view. After each GNN layer, generated node embeddings could be augmented by attending to prototypes. The whole pipeline is trained with a combination of three loss functions.

Figure 3: US 2012 election results map by county.An important property of ProtoGNN is introducing shortcuts between training nodes and non-training nodes via the constructed class prototypes. By facilitating the transfer of useful information from afar, shortcuts become especially helpful on heterophilous graphs, where useful information is often unavailable in local neighbourhoods. We use the US-election dataset to illustrate the effect of shortcuts. Here, nodes represent counties, and edges connect bordering counties. Red and blue colors represent the election outcomes of the counties. As visualized in Fig.3, nodes on the east coast and west coast are separated by a swathe of red counties, but share similar node features such as income, education and demographics. A shortcut between them via prototypes helps the message bypass the red counties in middle America, preventing message dilution in traditional message passing neural networks. Our results in Tab. 2 show that ProtoGNN achieves consistent performance improvement on US-election when added to all GNN backbones.

Figure 4: Effects of the number of prototypes per class(K).

Fig.5shows the t-sne plots of Pokec, Genius, Cornell5, Amherst41 and US-election datasets. The plots show the clustering patterns of the nodes in feature space consistently across the most of the datasets. This observation is the main motivation for our modelling of multiple prototypes per class for transferring global information across the nodes.

Figure 5: Feature space visualization of non-homophilous datasets.

Classification performance comparison against baselines. Notations follow style in Tab.2.

Classification accuracy on Cora, Citeseer and Pubmed with 5 labels per class.

Effects of Compatibility and Orthogonality losses on Amherst41 and Cornell5 datasets.

Effects of aggregator function used in Compatibility loss on Amherst41 and Cornell5 datasets.

Effect of Slot Attention on Amherst41

ProtoGNN exploits strong node features by constructing multiple class prototypes from training nodes. It then passes the information from prototype nodes to non-training nodes, updating their structural embeddings learned through GNN backbones. It thus creates shortcuts that bypass the local graph neighbourhoods and efficiently captures global information. ProtoGNN is a generic framework that can be applied to any existing GNN backbones. Our extensive experiments demonstrate that ProtoGNN brings consistent improvements when applied to various GNN backbones, achieving state-of-the-art results on multiple datasets against strong baselines.Lim et al., 2021) are friendship network datasets extracted from Facebook of students from selected universities from 2005. Each node in the datasets represent a student, while node label represents the reported gender of the student. Node features include major, second major/minor, dorm/house, year, and high school.Pokec(Lim et al., 2021) is a friendship graph dataset from a popular Slovak online social network, where nodes represent users and edges represent directed friendship relations. Node label represents the reported gender of user. Node features include geographical region, registration time, and age.Genius(Lim et al., 2021) is a sub-network from website genius.com, a crowd-sourced website of song lyrics annotations. Nodes represent users while edges connect users that follow each other. Node features include expertise scores expertise scores, counts of contributions and roles held by users. Around 20% of the users are marked with a "gone" label, indicating that they are more likely to be spam users. The task is to predict which users are marked. Twitch-gamers(Lim et al., 2021) is a sub-graph extracted from the streaming platform Twitch. Each node represents a Twitch account, while edges connect accounts that are mutual followers. Node features include number of views, creation and update dates, language, life time, and whether account is dead. The task is to predict whether the account has explicit content. US-election(Jia & Benson, 2020) is a geographical dataset extracted from statistics of Unite States election of year 2012. Nodes represent US counties, while edges connect bordering counties. Node features include income, education, population etc. The task is a binary classification to predict election outcome.

ETHICS STATEMENT

Our work is a generic framework to improve node representation learning on non-homophilous graphs. This work applies to GNNs generally and non-homophilous graph structured data specifically. Its ethical impacts (including positive and negative) depends on the specific domain of the data. We conduct our experiments on publicly available datasets in compliance with fair-use clauses. so the number of computations needed is O(nk), where k is the number of prototypes and is much smaller than the number of nodes n.We conduct an additional ablation study on self-attention vs slot attention to empirically verify our claims. We replace slot attention with self-attention in ProtoGNN and document the performance on 7 datasets in the following table. While ProtoGNN-SA with self-attention outperforms the backbone on most of the datasets, ProtoGNN with slot attention still outperforms it on all 7 datasets. To verify the computation efficiency empirically, we report the average runtime of ProtoGNN with self-attention and slot attention respectively. ProtoGNN with slot attention (10.14 milliseconds) runs much faster than ProtoGNN-SA (5.76 milliseconds).

A.6 HOMOPHILY MATRIX FOR HOMOPHILOUS DATASETS

Below are the homophily matrices for three well-known homophilous datasets: Cora, Citeseer and Pubmed (Yang et al., 2016) . High homophily is signified by the high numbers in diagonal cells, whereas values of non-diagonal cells are mostly less than 0.1. This is different from the homophily matrices of non-homophilous datasets in Tab. 2, where values of non-diagonal cells are similar or even higher than diagonal cells. 

