GLINKX: A SCALABLE UNIFIED FRAMEWORK FOR HOMOPHILOUS AND HETEROPHILOUS GRAPHS

Abstract

In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -GLINKX -that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets.

1. INTRODUCTION

In recent years, graph learning methods have emerged with a strong performance for various ML tasks. Graph ML methods leverage the topology of graphs underlying the data (Battaglia et al., 2018) to improve their performance. Two very important design options for proposing graph ML based architectures in the context of node classification are related to whether the data is homophilous or heterophilous. For homophilous data -where neighboring nodes share similar labels (McPherson et al., 2001;  Altenburger & Ugander, 2018a) -Graph Neural Network (GNN)-based methods are able to achieve high accuracy. Specifically, a broad subclass sucessfull GNNs are Graph Convolutional Networks (GCNs) (e.g., GCN, GAT, etc.) (Kipf & Welling, 2016; Veličković et al., 2017; Zhu et al., 2020) . In the GCN paradigm, message passing and higher-order interactions help node classification tasks in the homophilous setting since such inductive biases tend to bring the learned representations of linked nodes close to each other. However, GCN-based architectures suffer from scalability issues. Performing (higher-order) propagations during the training stage are hard to scale in large graphs because the number of nodes grows exponentially with the increase of the filter receptive field. Thus, for practical purposes, GCN-based methods require node sampling, substantially increasing their training time. For this reason, architectures (Huang et al., 2020; Zhang et al., 2022b; Sun et al., 2021; Maurya et al., 2021; Rossi et al., 2020) that leverage propagations outside of the training loop (as a preprocessing step) have shown promising results in terms of scaling to large graphs. In heterophilous datasets (Rogers et al., 2014) , the nodes that are connected tend to have different labels. Currently, many works that address heterophily can be classified into two categories concerning scale. On the one hand, recent successful architectures (in terms of accuracy) (Jin et al., 2022a; Di Giovanni et al., 2022; Zheng et al., 2022b; Luan et al., 2021; Chien et al., 2020; Lei et al., 2022) that address heterophily resemble GCNs in terms of design and thus suffer from the same scalability issues. On the other hand, shallow or node-level models (see, e.g., (Lim et al., 2021; Zhong et al., 2022) ), i.e., models that are treating graph data as tabular data and do not involve propagations during training, has shown a lot of promise for large heterophilous graphs. In (Lim et al., 2021) , it is shown that combining ego embeddings (node features) and adjacency embeddings works in the heterophilous setting. One element that LINKX exploits via the adjacency embeddings is monophily (Altenburger & Ugander, 2018a;b), namely the similarity of the labels of a node's neighbors. However, their design is still impractical in real-world data since the method (LINKX) is not inductive (see Section 2),

