PANREP: UNIVERSAL NODE EMBEDDINGS FOR HET-EROGENEOUS GRAPHS

Abstract

Learning unsupervised node embeddings facilitates several downstream tasks such as node classification and link prediction. A node embedding is universal if it is designed to be used by and benefit various downstream tasks. This work introduces PanRep, a graph neural network (GNN) model, for unsupervised learning of universal node representations for heterogenous graphs. PanRep consists of a GNN encoder that obtains node embeddings and four decoders, each capturing different topological and node feature properties. Abiding to these properties the novel unsupervised framework learns universal embeddings applicable to different downstream tasks. PanRep can be furthered fine-tuned to account for possible limited labels. In this operational setting PanRep is considered as a pretrained model for extracting node embeddings of heterogenous graph data. PanRep outperforms all unsupervised and certain supervised methods in node classification and link prediction, especially when the labeled data for the supervised methods is small. PanRep-FT (with fine-tuning) outperforms all other supervised approaches, which corroborates the merits of pretraining models. Finally, we apply PanRep-FT for discovering novel drugs for Covid-19. We showcase the advantage of universal embeddings in drug repurposing and identify several drugs used in clinical trials as possible drug candidates.

1. INTRODUCTION

Learning node representations from heterogeneous graph data powers the success of many downstream machine learning tasks such as node classification (Kipf & Welling, 2017) , and link prediction (Wang et al., 2017) . Graph neural networks (GNNs) learn node embeddings by applying a sequence of nonlinear operations parametrized by the graph adjacency matrix and achieve stateof-the-art performance in the aforementioned downstream tasks. The era of big data provides an opportunity for machine learning methods to harness large datasets (Wu et al., 2013) . Nevertheless, typically the labels in these datasets are scarce due to either lack of information or increased labeling costs (Bengio et al., 2012) . The lack of labeled data points hinders the performance of supervised algorithms, which may not generalize well to unseen data and motivates unsupervised learning. Unsupervised node embeddings may be used for downstream learning tasks, while the specific tasks are typically not known a priori. For example, node representations of the Amazon book graph can be employed for recommending new books as well as classifying a book's genre. This work aspires to provide universal node embeddings, which will be applied in multiple downstream tasks and achieve comparable performance to their supervised counterparts. Although unsupervised learning has numerous applications, limited labels of the downstream task may be available. Refining the unsupervised universal representations with these labels could further increase the representation power of the embeddings. This can be achieved by fine-tuning the unsupervised model. Natural language processing methods have achieved state-of-the-art performance by applying such a fine-tuning framework (Devlin et al., 2018) . Fine-tuning pretrained models is beneficial compared to end-to-end supervised learning since the former typically generalizes better especially when labeled data are limited and decreases the inference time since typically just a few fine-tuning iterations typically suffice for the model to converge (Erhan et al., 2010) . This work introduces a framework for unsupervised learning of universal node representations on heterogenous graphs termed PanRepfoot_0 . It consists of a GNN encoder that maps the heterogenous graph data to node embeddings and four decoders, each capturing different topological and node feature



Pan: Pangkosmios (Greek for universal) and Rep: Representation 1

