GLINKX: A SCALABLE UNIFIED FRAMEWORK FOR HOMOPHILOUS AND HETEROPHILOUS GRAPHS

Abstract

In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -GLINKX -that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets.

1. INTRODUCTION

In recent years, graph learning methods have emerged with a strong performance for various ML tasks. Graph ML methods leverage the topology of graphs underlying the data (Battaglia et al., 2018) to improve their performance. Two very important design options for proposing graph ML based architectures in the context of node classification are related to whether the data is homophilous or heterophilous. For homophilous data -where neighboring nodes share similar labels (McPherson et al., 2001;  Altenburger & Ugander, 2018a) -Graph Neural Network (GNN)-based methods are able to achieve high accuracy. Specifically, a broad subclass sucessfull GNNs are Graph Convolutional Networks (GCNs) (e.g., GCN, GAT, etc.) (Kipf & Welling, 2016; Veličković et al., 2017; Zhu et al., 2020) . In the GCN paradigm, message passing and higher-order interactions help node classification tasks in the homophilous setting since such inductive biases tend to bring the learned representations of linked nodes close to each other. However, GCN-based architectures suffer from scalability issues. Performing (higher-order) propagations during the training stage are hard to scale in large graphs because the number of nodes grows exponentially with the increase of the filter receptive field. Thus, for practical purposes, GCN-based methods require node sampling, substantially increasing their training time. For this reason, architectures (Huang et al., 2020; Zhang et al., 2022b; Sun et al., 2021; Maurya et al., 2021; Rossi et al., 2020) that leverage propagations outside of the training loop (as a preprocessing step) have shown promising results in terms of scaling to large graphs. In heterophilous datasets (Rogers et al., 2014) , the nodes that are connected tend to have different labels. Currently, many works that address heterophily can be classified into two categories concerning scale. On the one hand, recent successful architectures (in terms of accuracy) (Jin et al., 2022a; Di Giovanni et al., 2022; Zheng et al., 2022b; Luan et al., 2021; Chien et al., 2020; Lei et al., 2022) that address heterophily resemble GCNs in terms of design and thus suffer from the same scalability issues. On the other hand, shallow or node-level models (see, e.g., (Lim et al., 2021; Zhong et al., 2022) ), i.e., models that are treating graph data as tabular data and do not involve propagations during training, has shown a lot of promise for large heterophilous graphs. In (Lim et al., 2021) , it is shown that combining ego embeddings (node features) and adjacency embeddings works in the heterophilous setting. One element that LINKX exploits via the adjacency embeddings is monophily (Altenburger & Ugander, 2018a;b), namely the similarity of the labels of a node's neighbors. However, their design is still impractical in real-world data since the method (LINKX) is not inductive (see Section 2), and embedding the adjacency matrix directly requires many parameters in a model. In LINKX, the adjacency embedding of a node can alternatively be thought of as a positional embedding (PE) of the node in the graph, and recent developments (Kim et al., 2022; Dwivedi et al., 2021; Lim et al., 2021) have shown the importance of PEs in both homophilous and heterophilous settings. However, most of these works suggest PE parametrizations that are difficult to compute in large-scale settings. We argue that PEs can be obtained in a scalable manner by utilizing knowledge graph embeddings, which, according to prior work, can (El-Kishky et al., 2022; Lerer et al., 2019; Bordes et al., 2013; Yang et al., 2014) be trained in very large networks. Goal & Contribution: In this work, we develop a scalable method for node classification that: (i) works both on homophilous and heterophilous graphs, (ii) is simpler and faster than conventional message passing networks (by avoiding the neighbor sampling and message passing overhead during training), and (iii) can work in both a transductive and an inductivefoot_0 setting. For a method to be scalable, we argue that it should: (i) run models on node-scale (thus leveraging i.i.d. minibatching), (ii) avoid doing message passing during training and do it a constant number of times before training, and (iii) transmit small messages along the edges. Our proposed method -GLINKX (see Section 3)combines all the above desiderata. GLINKX has three components: (i) ego embeddingsfoot_1 , (ii) PEs inspired by architectures suited for heterophilous settings, and (iii) scalable 2nd-hop-neighborhood propagations inspired by architectures suited for monophilous settings (Section 2.5). We provide novel theoretical error bounds and justify components of our method (Section 3.4). Finally, we evaluate GLINKX's empirical effectiveness on several homophilous and heterophilous datasets (Section 4).

2. PRELIMINARIES 2.1 NOTATION

We denote scalars with lower-case, vectors with bold lower-case letters, and matrices with bold uppercase. We consider a directed graph G = G(V, E) with vertex set V with |V | = n nodes, and edge set E with |E| = m edges, and adjacency matrix A. X 2 R n⇥d X represents the d X -dimensional features and P 2 R n⇥d P represent the d P -dimensional PE matrix. A node i has a feature vector x i 2 R d X and a positional embedding p i 2 R d P and belongs to a class y i 2 {1, . . . , c}. The training set is denoted by G train (V train , E train ), the validation set by G valid (V valid , E valid ), and test set by G test (V test , E test ). I{•} is the indicator function. c is the c-dimensional simplex.

2.2. GRAPH CONVOLUTIONAL NEURAL NETWORKS

In homophilous datasets, GCN-based methods have been used for node classification. GCNs (Kipf & Welling, 2016) utilize feature propagations together with non-linearities to produce node embeddings. Specifically, a GCN consists of multiple layers where each layer i collects i-th hop information from the nodes through propagations and forwards this information to the i + 1-th layer. More specifically, if G has a symmetrically-normalized adjacency matrix A 0 sym (with self-loops) (ignoring the directionality of edges), then a GCN layer has the form H (0) = X, H (i+1) = ⇣ A 0 sym H (i) W (i) ⌘ 8i 2 [L], Y = softmax ⇣ H (L) ⌘ . Here H (i) is the embedding from the previous layer, W (i) is a learnable projection matrix and (•) is a non-linearity (e.g. ReLU, sigmoid, etc.).

2.3. LINKX

In heterophilous datasets, the simple method of LINKX has been shown to perform well. LINKX combines two components -MLP on the node features X and LINK regression (Altenburger & Ugander, 2018a) on the adjacency matrix -as follows:



For this paper, we operate in the transductive setting. See App. B for the inductive setting. We use ego embeddings and node features interchangeably.

