SELF-SUPERVISED GRAPH-LEVEL REPRESENTATION LEARNING WITH LOCAL AND GLOBAL STRUCTURE Anonymous

Abstract

This paper focuses on unsupervised/self-supervised whole-graph representation learning, which is critical in many tasks including drug and material discovery. Current methods can effectively model the local structure between different graph instances, but they fail to discover the global semantic structure of the entire dataset. In this work, we propose a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised whole-graph representation learning. Specifically, besides preserving the local instance-level structure, GraphLoG leverages a nonparametric strategy to learn hierarchical prototypes of the data. These prototypes capture the semantic clusters in the latent space, and the number of prototypes can automatically adapt to different feature distributions. We evaluate GraphLoG by pre-training it on massive unlabeled graphs followed by fine-tuning on downstream tasks. Extensive experiments on both chemical and biological benchmark datasets demonstrate the effectiveness of our approach.

1. INTRODUCTION

Learning informative representations of whole graphs is a fundamental problem in a variety of domains and tasks, such as molecule properties prediction in drug and material discovery (Gilmer et al., 2017; Wu et al., 2018) , protein function forecast in biological networks (Alvarez & Yan, 2012; Jiang et al., 2017) , and predicting the properties of circuits in circuit design (Zhang et al., 2019) . Recently, Graph Neural Networks (GNNs) have attracted a surge of interest and showed the effectiveness in learning graph representations. These methods are usually trained in a supervised fashion, which requires a large number of labeled data. Nevertheless, in many scientific domains, labeled data are very limited and expensive to obtain. Therefore, it is becoming increasingly important to learn the representations of graphs in an unsupervised or self-supervised fashion. Self-supervised learning has recently achieved profound success for both natural language processing, e.g. GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) , and image understanding, e.g. MoCo (He et al., 2019) and SimCLR (Chen et al., 2020) . However, how to effectively learn the representations of graphs in a self-supervised way is still an open problem. Intuitively, a desirable graph representation should be able to preserve the local-instance structure, so that similar graphs are embedded close to each other and dissimilar ones stay far apart. In addition, the representations of a set of graphs should also reflect the global-semantic structure of the data, so that the graphs with similar semantic properties are compactly embedded, which benefits various downstream tasks, e.g. graph classification or regression. Such structure can be sufficiently captured by semantic clusters (Caron et al., 2018; Ji et al., 2019) , especially in a hierarchical fashion (Li et al., 2020) . There are some recent works that learn graph representation in a self-supervised manner, such as local-global mutual information maximization (Velickovic et al., 2019; Sun et al., 2019) , structuralsimilarity/context prediction (Navarin et al., 2018; Hu et al., 2019; You et al., 2020) and contrastive multi-view learning (Hassani & Ahmadi, 2020) . However, all these methods are capable of modeling only the local structure between different graph instances but fail to discover the global-semantic structure. To address this shortcoming, we are seeking for an approach that is sufficient to model both the local and global structure of a given set of graphs. To attain this goal, we propose a Local-instance and Global-semantic Learning (GraphLoG) framework for self-supervised graph representation learning. In specific, for preserving the local similarity between various graph instances, we first align the embeddings of correlated graphs/subgraphsfoot_0 by maximizing their mutual information. In this locally smooth embedding space, we further represent the distribution of different graph embeddings with hierarchical prototypesfoot_1 whose number is adaptively determined by the data in a nonparametric fashion. During training, these prototypes guide each graph to map to the semantically-similar feature cluster, and, simultaneously, the prototypes are maintained by online-updated graph embeddings. In this process, the global-semantic structure of the data is gradually discovered and refined. The whole model is pre-trained with a large number of unlabeled graphs, and then fine-tuned and evaluated on some downstream tasks. We summarize our contributions as follows: • We contribute a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised graph representation learning, which is able to model the structure of a set of graphs both locally and globally. • We novelly propose to infer the global-semantic structure underlying the unlabeled graphs by learning hierarchical prototypes via a nonparametric strategy. • We empirically verify our framework's superior performance on different GNN architectures through pre-training on a large-scale unlabeled dataset and fine-tuning on benchmark tasks in both the chemistry and biology domains.

2. PROBLEM DEFINITION AND PRELIMINARIES

2.1 PROBLEM DEFINITION An ideal representation should preserve the local structure among the data instances. More specifically, we define it as follows: Definition 1 (Local-instance Structure). The local-instance structure refers to the local pairwise similarity between different instances (Roweis & Saul, 2000; Belkin & Niyogi, 2002) . To preserve the local-instance structure of graph-structured data, a pair of similar graphs/subgraphs, G and G , are expected to be mapped to the nearby positions of embedding space, as illustrated in Fig. 1(a) , while the dissimilar pairs should be mapped to far apart. The pursuit of local-instance structure is usually insufficient to capture the semantics underlying the entire dataset. It is therefore important to discover the global-semantic structure of the data, which is concretely defined as follows: Definition 2 (Global-semantic Structure). A real-world dataset is usually distributed as different semantic clusters (Furnas et al., 2017; Ji et al., 2019) . Therefore, we define the global-semantic structure of a dataset as the distribution of its semantic clusters, and each cluster is represented by a prototype (i.e. a representative cluster embedding). Since the semantics of a set of graphs can be structured in a hierarchical way (Ashburner et al., 2000; Chen et al., 2012) , we represent the whole dataset with hierarchical prototypes. A detailed example can be seen in Fig. 1 (b).



In our method, we obtain correlated graphs/subgraphs via minor modification on node/edge attributes. Hierarchical prototypes are representative cluster embeddings organized in a hierarchical way.



Figure 1: Illustration of GraphLoG. (a) Correlated graphs are constrained to be adjacently embedded to pursue the local-instance structure of the data. (b) Hierarchical prototypes are employed to discover and refine the global-semantic structure of the data.

