PANREP: UNIVERSAL NODE EMBEDDINGS FOR HET-EROGENEOUS GRAPHS

Abstract

Learning unsupervised node embeddings facilitates several downstream tasks such as node classification and link prediction. A node embedding is universal if it is designed to be used by and benefit various downstream tasks. This work introduces PanRep, a graph neural network (GNN) model, for unsupervised learning of universal node representations for heterogenous graphs. PanRep consists of a GNN encoder that obtains node embeddings and four decoders, each capturing different topological and node feature properties. Abiding to these properties the novel unsupervised framework learns universal embeddings applicable to different downstream tasks. PanRep can be furthered fine-tuned to account for possible limited labels. In this operational setting PanRep is considered as a pretrained model for extracting node embeddings of heterogenous graph data. PanRep outperforms all unsupervised and certain supervised methods in node classification and link prediction, especially when the labeled data for the supervised methods is small. PanRep-FT (with fine-tuning) outperforms all other supervised approaches, which corroborates the merits of pretraining models. Finally, we apply PanRep-FT for discovering novel drugs for Covid-19. We showcase the advantage of universal embeddings in drug repurposing and identify several drugs used in clinical trials as possible drug candidates.

1. INTRODUCTION

Learning node representations from heterogeneous graph data powers the success of many downstream machine learning tasks such as node classification (Kipf & Welling, 2017) , and link prediction (Wang et al., 2017) . Graph neural networks (GNNs) learn node embeddings by applying a sequence of nonlinear operations parametrized by the graph adjacency matrix and achieve stateof-the-art performance in the aforementioned downstream tasks. The era of big data provides an opportunity for machine learning methods to harness large datasets (Wu et al., 2013) . Nevertheless, typically the labels in these datasets are scarce due to either lack of information or increased labeling costs (Bengio et al., 2012) . The lack of labeled data points hinders the performance of supervised algorithms, which may not generalize well to unseen data and motivates unsupervised learning. Unsupervised node embeddings may be used for downstream learning tasks, while the specific tasks are typically not known a priori. For example, node representations of the Amazon book graph can be employed for recommending new books as well as classifying a book's genre. This work aspires to provide universal node embeddings, which will be applied in multiple downstream tasks and achieve comparable performance to their supervised counterparts. Although unsupervised learning has numerous applications, limited labels of the downstream task may be available. Refining the unsupervised universal representations with these labels could further increase the representation power of the embeddings. This can be achieved by fine-tuning the unsupervised model. Natural language processing methods have achieved state-of-the-art performance by applying such a fine-tuning framework (Devlin et al., 2018) . Fine-tuning pretrained models is beneficial compared to end-to-end supervised learning since the former typically generalizes better especially when labeled data are limited and decreases the inference time since typically just a few fine-tuning iterations typically suffice for the model to converge (Erhan et al., 2010) . This work introduces a framework for unsupervised learning of universal node representations on heterogenous graphs termed PanRepfoot_0 . It consists of a GNN encoder that maps the heterogenous graph data to node embeddings and four decoders, each capturing different topological and node feature PanRep can be considered as a pretrained model for extracting node embeddings of heterogenous graph data. Figure 1 illustrates the two novel models. The contribution of this work is threefold. C1. We introduce a novel problem formulation of universal unsupervised learning and design a tailored learning framework termed PanRep. We identify the following general properties of the heterogenous graph data: (i) the clustering of local node features, (ii) structural similarity among nodes, (iii) the local and intermediate neighborhood structure, (iv) and the mutual information among same-type nodes. We develop four novel decoders to model the aforementioned properties. C2. We adjust the unsupervised universal learning framework to account for possible limited labels of the downstream task. PanRep-FT refines the universal embeddings and increases the model generalization capability. C3.We compare the proposed models to state-of-the-art supervised and unsupervised methods for node classification and link prediction. PanRep outperforms all unsupervised and certain supervised methods in node classification, especially when the labeled data for the supervised methods is small. PanRep-FT outperforms even supervised approaches in node classification and link prediction, which corroborates the merits of pretraining models. Finally, we apply our method on the drug-repurposing knowledge graph (DRKG) for discovering drugs for Covid-19 and identify several drugs used in clinical trials as possible drug candidates.

2. RELATED WORK

Unsupervised learning. Representation learning amounts to mapping nodes in an embedding space where the graph topological information and structure is preserved (Hamilton et al., 2017) . Typically, representation learning methods follow the encoder-decoder framework advocated by PanRep. Nevertheless, the decoder is typically attuned to a single task based on e.g., matrix factorization (Tang et al., 2015; Ahmed et al., 2013; Cao et al., 2015; Ou et al., 2016 ), random walks (Grover & Leskovec, 2016; Perozzi et al., 2014) , or kernels on graphs (Smola & Kondor, 2003) . Recently, methods relying on GNNs are increasingly popular for representation learning tasks (Wu et al., 2020) . GNNs typically rely on random walk-based objectives (Grover & Leskovec, 2016; Hamilton et al., 2017) or on maximizing the mutual information among node representations (Veličković et al., 2018b) . Relational GNNs methods extend representation learning to heterogeneous graphs (Dong et al., 2017; Shi et al., 2018; Shang et al., 2016) . Relative to these contemporary works PanRep introduces multiple decoders to learn universal embeddings for heterogeneous graph data capturing the clustering of local node features, structural similarity among nodes, the local and intermediate neighborhood structure, and the mutual information among same-type nodes.



Pan: Pangkosmios (Greek for universal) and Rep: Representation



Figure 1: Illustration of the PanRep (left) and PanRep-FT (right) models. The GNN encoder processes the node features X to obtain the embeddings H. The decoders facilitate unsupervised learning of H.

