SELF-SUPERVISED GRAPH-LEVEL REPRESENTATION LEARNING WITH LOCAL AND GLOBAL STRUCTURE Anonymous

Abstract

This paper focuses on unsupervised/self-supervised whole-graph representation learning, which is critical in many tasks including drug and material discovery. Current methods can effectively model the local structure between different graph instances, but they fail to discover the global semantic structure of the entire dataset. In this work, we propose a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised whole-graph representation learning. Specifically, besides preserving the local instance-level structure, GraphLoG leverages a nonparametric strategy to learn hierarchical prototypes of the data. These prototypes capture the semantic clusters in the latent space, and the number of prototypes can automatically adapt to different feature distributions. We evaluate GraphLoG by pre-training it on massive unlabeled graphs followed by fine-tuning on downstream tasks. Extensive experiments on both chemical and biological benchmark datasets demonstrate the effectiveness of our approach.

1. INTRODUCTION

Learning informative representations of whole graphs is a fundamental problem in a variety of domains and tasks, such as molecule properties prediction in drug and material discovery (Gilmer et al., 2017; Wu et al., 2018) , protein function forecast in biological networks (Alvarez & Yan, 2012; Jiang et al., 2017) , and predicting the properties of circuits in circuit design (Zhang et al., 2019) . Recently, Graph Neural Networks (GNNs) have attracted a surge of interest and showed the effectiveness in learning graph representations. These methods are usually trained in a supervised fashion, which requires a large number of labeled data. Nevertheless, in many scientific domains, labeled data are very limited and expensive to obtain. Therefore, it is becoming increasingly important to learn the representations of graphs in an unsupervised or self-supervised fashion. Self-supervised learning has recently achieved profound success for both natural language processing, e.g. GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) , and image understanding, e.g. MoCo (He et al., 2019) and SimCLR (Chen et al., 2020) . However, how to effectively learn the representations of graphs in a self-supervised way is still an open problem. Intuitively, a desirable graph representation should be able to preserve the local-instance structure, so that similar graphs are embedded close to each other and dissimilar ones stay far apart. In addition, the representations of a set of graphs should also reflect the global-semantic structure of the data, so that the graphs with similar semantic properties are compactly embedded, which benefits various downstream tasks, e.g. graph classification or regression. Such structure can be sufficiently captured by semantic clusters (Caron et al., 2018; Ji et al., 2019) , especially in a hierarchical fashion (Li et al., 2020) . There are some recent works that learn graph representation in a self-supervised manner, such as local-global mutual information maximization (Velickovic et al., 2019; Sun et al., 2019) , structuralsimilarity/context prediction (Navarin et al., 2018; Hu et al., 2019; You et al., 2020) and contrastive multi-view learning (Hassani & Ahmadi, 2020) . However, all these methods are capable of modeling only the local structure between different graph instances but fail to discover the global-semantic structure. To address this shortcoming, we are seeking for an approach that is sufficient to model both the local and global structure of a given set of graphs. To attain this goal, we propose a Local-instance and Global-semantic Learning (GraphLoG) framework for self-supervised graph representation learning. In specific, for preserving the local similarity between various graph instances, we first align the embeddings of correlated graphs/subgraphsfoot_0 by maximizing their mutual information. In this locally smooth embedding space, we further represent the distribution of different graph embeddings with hierarchical prototypesfoot_1 whose number is adaptively determined by the data in a nonparametric fashion. During training, these prototypes guide each graph to map to the semantically-similar feature cluster, and, simultaneously, the prototypes are maintained by online-updated graph embeddings. In this process, the global-semantic structure of the data is gradually discovered and refined. The whole model is pre-trained with a large number of unlabeled graphs, and then fine-tuned and evaluated on some downstream tasks. We summarize our contributions as follows: • We contribute a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised graph representation learning, which is able to model the structure of a set of graphs both locally and globally. • We novelly propose to infer the global-semantic structure underlying the unlabeled graphs by learning hierarchical prototypes via a nonparametric strategy. • We empirically verify our framework's superior performance on different GNN architectures through pre-training on a large-scale unlabeled dataset and fine-tuning on benchmark tasks in both the chemistry and biology domains.

2.1. PROBLEM DEFINITION

An ideal representation should preserve the local structure among the data instances. More specifically, we define it as follows: Definition 1 (Local-instance Structure). The local-instance structure refers to the local pairwise similarity between different instances (Roweis & Saul, 2000; Belkin & Niyogi, 2002) . To preserve the local-instance structure of graph-structured data, a pair of similar graphs/subgraphs, G and G , are expected to be mapped to the nearby positions of embedding space, as illustrated in Fig. 1 (a), while the dissimilar pairs should be mapped to far apart. The pursuit of local-instance structure is usually insufficient to capture the semantics underlying the entire dataset. It is therefore important to discover the global-semantic structure of the data, which is concretely defined as follows: Definition 2 (Global-semantic Structure). A real-world dataset is usually distributed as different semantic clusters (Furnas et al., 2017; Ji et al., 2019) . Therefore, we define the global-semantic structure of a dataset as the distribution of its semantic clusters, and each cluster is represented by a prototype (i.e. a representative cluster embedding). Since the semantics of a set of graphs can be structured in a hierarchical way (Ashburner et al., 2000; Chen et al., 2012) , we represent the whole dataset with hierarchical prototypes. A detailed example can be seen in Fig. 1(b ). Problem Definition. For self-supervised graph representation learning, a set of unlabeled graphs G = {G 1 , G 2 , • • • , G M } are given, and we aim to learn a low-dimensional vector h Gi ∈ R δ for each graph G i ∈ G under the guidance of the data itself. In specific, we expect the derived graph embeddings H ∈ R M ×δ follow both the local-instance and global-semantic structure.

2.2. PRELIMINARIES

Graph Neural Networks (GNNs). Given a graph G = (V, E) with node attributes X V = {X v |v ∈ V} and edge attributes X E = {X uv |(u, v) ∈ E}, a GNN aims to learn an embedding vector h v for each node v ∈ V and also a vector h G for the entire graph G. For an L-layer GNN, a neighborhood aggregation scheme is performed to capture the L-hop information surrounding each node. The l-th layer of a GNN can be formalized as follows: h (l) v = COMBINE (l) h (l-1) v , AGGREGATE (l) h (l-1) v , h (l-1) u , X uv : u ∈ N (v) , where N (v) is the neighborhood set of v, h v denotes the representation of node v at the l-th layer, and h (0) v is initialized as the node attribute X v . Since h v summarizes the information of a patch centered around node v, we will refer to h v as patch embedding to underscore this point. The entire graph's embedding can be derived by a permutation-invariant readout function: h G = READOUT h v |v ∈ V . (2) Mutual Information Estimation. Mutual information (MI) can measure both the linear and nonlinear dependency between two random variables. Some recent works (Belghazi et al., 2018; Hjelm et al., 2019) employed neural networks to estimate the lower bound of MI. Among which, InfoNCE loss (van den Oord et al., 2018) has been introduced to maximize a lower bound of MI by minimizing itself, and we also adopt it in this work for its simplicity and effectiveness. In practice, given a query q, InfoNCE loss is optimized to score the positive sample z + higher than a set of distractors {z i } K i=1 : L NCE q, z + , {z i } K i=1 = -log exp T (q, z + ) exp T (q, z + ) + K i=1 exp T (q, z i ) , where T (•, •) is a parameterized discriminator function which maps two representation vectors to a scalar value, whose architecture is detailed in Sec. 6.1. Rival Penalized Competitive Learning (RPCL). The RPCL method (Xu et al., 1993) is a variant of classical competitive learning approaches, e.g. K-means clustering. Concretely, given a sample for update, RPCL-based clustering not only pulls the winning cluster center (i.e. the closest one) towards the sample, but also pushes the rival cluster center (i.e. the second closest one) away from the sample. We adopt this clustering algorithm for its strong capability of discovering feature clusters without specifying the number of clusters beforehand (i.e. in a nonparametric fashion), which is critical in the self-supervised learning scenarios where the number of semantic categories is not given.

3.1. LEARNING LOCAL-INSTANCE STRUCTURE OF GRAPH REPRESENTATIONS

We first define the correlated graphs that are expected to be embedded close to each other in the embedding space. Since the graphs from a dataset lie in a highly discrete space, it is hard to seek out the correlated counterpart of each graph from the dataset. To tackle this limitation, we propose to construct pairs of correlated graphs via the attribute masking strategy (Hu et al., 2019) which randomly masks a part of node/edge attributes in a graph (theoretical analysis is stated in Sec. A). Through applying this technique to a randomly sampled mini-batch B G = {G j = (V j , E j )} N j=1 with N graphs, the correlated counterpart of each graph can be obtained, which forms another minibatch B G = {G j = (V j , E j )} N j=1 (G j and G j are deemed as a pair of correlated graphs). Taking both mini-batches as input, the corresponding patch and graph embeddings are derived as follows: h Vj = {h v |v ∈ V j } = GNN(X Vj , X Ej ), h V j = {h v |v ∈ V j } = GNN(X V j , X E j ), h Gj = READOUT(h Vj ), h G j = READOUT(h V j ), where h Vj (h V j ) is the set of patch embeddings for graph G j (G j ), and h Gj (h G j ) denotes the embedding of entire graph. With these ingredients, we design the learning objective for local-instance structure based on two desiderata: (1) similar subgraphs (i.e. patches) have similar feature representations; (2) graphs with a set of similar patches are embedded close to each other. To attain these goals, we propose to maximize the mutual information (i.e. minimize the InfoNCE loss) between correlated patches/graphs, which derives two constraints for the local-instance structure: L patch = 1 N j=1 |V j | N j=1 v ∈V j v∈Vj 1 v↔v • L NCE h v , h v , {h ṽ |ṽ ∈ V j , ṽ = v} , L graph = 1 N N j=1 L NCE h G j , h Gj , {h G k |1 k N, k = j} , L local = L patch + L graph , where L NCE (•, •, •) is the InfoNCE loss function defined in Eq. 3, and 1 v↔v denotes the indicator function judging whether v and v are the corresponding nodes in a pair of correlated graphs. Note that, masking node/edge attributes doesn't change the topology of a graph, which makes it easy to determine these corresponding nodes in our method.

3.2. LEARNING GLOBAL-SEMANTIC STRUCTURE OF GRAPH REPRESENTATIONS

It is worth noticing that the graphs in a dataset may possess hierarchical semantic information. For example, drugs (i.e. molecular graphs) are represented by a five-level hierarchy in the Anatomical Therapeutic Chemical (ATC) classification system (Chen et al., 2012) . Moreover, the biological functions of proteins (i.e. graphs of amino acid residues) can be organized in a hierarchical structure (e.g. Gene Ontology (Ashburner et al., 2000) and FunCat (Ruepp et al., 2004) protein functionaldefinition schemes). Motivated by this fact, we propose the notion of hierarchical prototypes to describe the distributions of graph embeddings. These prototypes are structured as a set of trees (Fig. 1 (b)), in which each node denotes a prototype (i.e. a representative embedding of feature cluster) and corresponds to an unique parent node unless it is at the top layer. Formally, the hierarchical prototypes can be represented as {c l i } M l i=1 (l = 1, 2, • • • , L p ) , where L p denotes the depth of hierarchical prototypes, and M l is the number of prototypes at the l-th layer. Except for the leaf nodes, each prototype possesses a set of child nodes, denoted as C(c l i ) (1 i M l , l = 1, 2, • • • , L p -1). During training, the derivation of these variables can be divided into two stages, i.e. initialization and maintenance. Initialization of hierarchical prototypes. In order to establish appropriate priors of graph embeddings, we first pre-train the GNN by minimizing L local for one epoch and utilize it to extract the embeddings of all graphs in the training set, denoted as {h Gi } N D i=1 (N D is the size of training set). These embeddings are used to initialize the bottom layer prototypes (i.e. {c Lp i } M Lp i=1 ) via the RPCL-based clustering algorithm (Sec. 2.2): {c Lp i } M Lp i=1 = RPCL {h Gi } N D i=1 , where RPCL(•) outputs the cluster centers that are assigned with at least one sample. After that, the prototypes of upper layers are initialized by iteratively applying RPCL-based clustering to the prototypes of the layer below: {c l i } M l i=1 = RPCL {c l+1 i } M l+1 i=1 , l = 1, 2, • • • , L p -1. It is noteworthy that, in the initialization scheme, the number of prototypes is automatically adapted to the distribution of graph embeddings. As a result, this scheme is nonparametric and can adapt to different datasets without the prior knowledges about them. Maintenance of hierarchical prototypes.  l i } M l i=1 (l = 1, 2, • • • , L p ) for t = 1 to N T do B G ← RandomSample(D) # Get a mini-batch of graphs B G ← AttrMasking(B G ) # Get the correlated graphs h Vj , h V j , h Gj , h G j ← Eqs. (4, 5) (j = 1, 2, • • • , N ) # Extract patch and graph embeddings L local , L global ← Eqs. (8, 13) # Compute losses θ GNN + ← -∇ θGNN (L local + L global ) # Update GNN's parameters θ T + ← -∇ θ T (L local + L global ) # Update discriminator's parameters {c l i } M l i=1 ← Eqs. (11, 12) (l = 1, 2, • • • , L p ) # Maintain hierarchical prototypes end for divided into M Lp groups according to their most similar bottom layer prototype, and the mean graph embeddings are computed within each group, denoted as { c Lp i } M Lp i=1 . These mean embeddings are employed to update bottom layer prototypes via an exponential moving average scheme: c Lp i ← βc Lp i + (1 -β) c Lp i , 1 i M Lp , (11) where β is the exponential decay rate. For the prototypes of upper layers, they are updated with the mean of their child prototypes in the corresponding tree: c l i ← 1 C(c l i ) c l+1 k ∈ C(c l i ) c l+1 k , 1 i M l , l = 1, 2, • • • , L p -1. Constraint for global-semantic structure. Now that the latent semantic structure of the data has been represented by hierarchical prototypes, we seek to constrain the distributions of graph embeddings with these prototypes. The major goal here is to map correlated graphs to the same set of feature clusters. In practice, according to cosine similarity, we first search for the prototypes most similar to the embedding of graph G j in each layer, denoted as s(G j ) = {s 1 (G j ), s 2 (G j ), • • • , s Lp (G j )}. Note that, this search process follows the topology of hierarchical prototypes, which means that: s l+1 (G j ) ∈ C(s l (G j )) (l = 1, 2, • • • , L p -1) . Correspondingly, when using the embedding of the correlated graph G j for search, we expect an identical searching path, and such objective is pursued by maximizing the mutual information (i.e. minimizing the InfoNCE loss) between graph embedding h G j and the prototypes in s(G j ): L global = 1 N • L p N j=1 Lp l=1 L NCE h G j , s l (G j ), {c l i |1 i M l , c l i = s l (G j )} . ( ) Discussion. A recent work (Li et al., 2020) employed hierarchical prototypes for visual representation learning. The semantic hierarchy established in that work is derived from multiple times of clustering with different numbers of clusters, which relies on heuristically selected cluster numbers and fails to model the relations between the prototypes from different hierarchies. In contrast, our method is free from pre-defined cluster numbers, and a set of relational trees are constructed to embody the hierarchical relations between different prototypes.

3.3. MODEL OPTIMIZATION

The training procedure of Local-instance and Global-semantic Learning (GraphLoG) is summarized in Algorithm 1. Along training, using the online-updated graph embeddings, the constraints for local-instance and global-semantic structure are derived, and the hierarchical prototypes are maintained. In each iteration, the parameters of GNN and discriminator are optimized with gradient descent using the following objective: min GNN,T L local + L global . ( )

4. SUP-GRAPHLOG: A SUPERVISED BASELINE FOR LOCAL-INSTANCE AND GLOBAL-SEMANTIC LEARNING

In order to verify the effectiveness of local-instance and global-semantic learning when it is directly applied to supervised downstream tasks, we propose a baseline model, named as sup-GraphLoG, which combines a plain GNN and the proposed hierarchical prototypes. In the training phase, in order to establish appropriate local-instance structure of graph embeddings, the GNN is first pre-trained along with a linear classifier to perform graph classification on the training set. For the initialization of hierarchical prototypes, the number of bottom layer prototypes is set as the class number of the supervised task (e.g. 2K T bottom layer prototypes for a task with K T binary classification problems), and each bottom layer prototype is the mean embedding of all the training graphs belonging to the corresponding class. The upper layer prototypes are initialized as in Eq. 10. For maintenance, given a mini-batch of labeled graphs, each bottom layer prototype is updated by the mean embedding of all the graphs belonging to the corresponding class using an exponential moving average scheme as in Eq. 11, and the prototypes of upper layers are maintained following Eq. 12. For constraining the global-semantic structure, compared with the top-down search in the selfsupervised model (Sec. 3.2), in this supervised setting, we first use the label of graph G j to randomly select a matched bottom layer prototype s Lp (G j ) and then obtain the whole searching path s(G j ) = {s 1 (G j ), s 2 (G j ), • • • , s Lp (G j )} from bottom to up. Based on this positive searching path, we randomly sample a negative path s n (G j ) = {s n 1 (G j ), s n 2 (G j ), • • • , s n Lp (G j )} satisfy- ing that graph G j does not belong to the corresponding class of s n Lp (G j ), and s n l (G j ) = s l (G j ), s n l+1 (G j ) ∈ C(s n l (G j )) (l = 1, 2, • • • , L p -1). It is expected that the embedding of graph G j is more similar with the prototypes on path s(G j ) than the ones on path s n (G j ), which defines the loss constraint on a mini-batch as follows: L sup global = 1 N • L p N j=1 Lp l=1 L NCE h Gj , s l (G j ), s n l (G j ) . We further optimize the GNN by minimizing this loss, which refines the global-semantic structure in the embedding space. In the inference phase, given an unlabeled graph, we first compute the similarity between its embedding and all the bottom layer prototypes via the cosine similarity function. After that, the taskspecific prediction is derived by comparing the similarity scores of the classes corresponding to these prototypes, in which the classes with larger scores serve as the prediction result.

5. RELATED WORK

Graph Neural Networks (GNNs). Recently, following the efforts of learning graph representations via optimizing random walk (Perozzi et al., 2014; Tang et al., 2015; Grover & Leskovec, 2016; Narayanan et al., 2017) or matrix factorization (Cao et al., 2015; Wang et al., 2016) objectives, GNNs are proposed to explicitly derive proximity-preserved feature vectors in a neighborhood aggregation manner. As suggested in Gilmer et al. (2017) , the forward pass of most GNNs can be depicted in two phases, Message Passing and Readout phase, and various works (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Ying et al., 2018; Zhang et al., 2018; Xu et al., 2019) sought to improve the effectiveness of these two phases. Unlike these methods which are mainly trained in a supervised fashion, our approach aims for unsupervised/self-supervised learning for GNNs. & Ahmadi (2020) achieved this goal through mutual information maximization. Also, some selfsupervised tasks, e.g. edge prediction (Kipf & Welling, 2016) , context prediction (Hu et al., 2019; Rong et al., 2020a) , graph partitioning (You et al., 2020) and edge/attribute generation (Hu et al., 2020) , have been designed to acquire knowledges from unlabeled graphs. Nevertheless, all these methods are only able to model the local relations between different graph instances. The proposed framework seeks to discover both the local-instance and global-semantic structure of a set of graphs.

Self-supervised Learning for

Self-supervised Semantic Learning. Clustering-based methods (Xie et al., 2016; Yang et al., 2016; 2017; Caron et al., 2018; Ji et al., 2019; Li et al., 2020) are commonly used to learn the semantic information of the data in a self-supervised fashion. Among which, DeepCluster (Caron et al., 2018) proved the strong transferability of the visual representations learnt by clustering prediction to various downstream visual tasks. Prototypical Contrastive Learning (Li et al., 2020) set a new stateof-the-art for unsupervised visual representation learning. These methods are mainly developed for images but not for graph-structured data. Furthermore, the hierarchical semantic structure of the data has been less explored in previous works.

6.1. EXPERIMENTAL SETUP

Pre-training details. Following Hu et al. ( 2019), we adopt a five-layer Graph Isomorphism Network (GIN) (Xu et al., 2019) with 300-dimensional hidden units and a mean pooling readout function for performance comparisons (Secs. 6.2 and 6.3). The discriminator for mutual information estimation is formalized as: T (x 1 , x 2 ) = g(f (x 1 ), f (x 2 )) , where f (•) is a projection function fitted by two linear layers and a ReLU nonlinearity between them, and g(•, •) is a similarity function (e.g. cosine similarity in our method). In all experiments, we use an Adam optimizer (Kingma & Ba, 2015) (learning rate: 1 × 10 -3 , batch size: 512) to train the model for 20 epochs. Unless otherwise specified, the hierarchical prototypes' depth L p is set as 3, and the exponential decay rate β is set as 0.95. For attribute masking, 30% node attributes in molecular graphs are masked, and 30% edge attributes in Protein-Protein Interaction (PPI) networks are masked. These hyperparameters are selected by the grid search on the validation sets of four downstream molecule datasets (i.e. BBBP, SIDER, ClinTox and BACE), and their sensitivity is analyzed in Secs. 6.4 and F. Fine-tuning details. For fine-tuning on a downstream task, a linear classifier is appended on the top of pre-trained GNN, and an Adam optimizer (classifier's learning rate: 1 × 10 -3 , GNN's learning rate: 1×10 -4 , batch size: 32) is employed to train the model for 100 epochs. For sup-GraphLoG, the GNN is first trained along with a linear classifier for 50 epochs using an Adam optimizer (learning rate: 1×10 -3 , batch size: 32), and it is then fine-tuned under the guidance of hierarchical prototypes by an Adam optimizer (learning rate: 1×10 -4 , batch size: 32). All reported results are averaged over five independent runs under the same configuration. Our approach is implemented with PyTorch (Paszke et al., 2017) , and the source code will be released for reproducibility. Performance comparison. We compare the proposed method with existing self-supervised graph representation learning algorithms (i.e. EdgePred (Kipf & Welling, 2016) , InfoGraph (Sun et al., 2019) , AttrMasking (Hu et al., 2019) , ContextPred (Hu et al., 2019) and GraphPartition (You et al., 2020) ) to verify its effectiveness. Following the setting in Hu et al. (2019) , after pre-training GNN models with self-supervised methods, a graph-level multi-task supervised pre-training is conducted to achieve more transferable graph representations, and the performance on downstream tasks is respectively evaluated before and after this graph-level supervised pre-training.

6.2. EXPERIMENTS ON CHEMISTRY DOMAIN

Datasets. For fair comparison, we use the same datasets as in Hu et al. (2019) . In specific, a subset of ZINC15 database (Sterling & Irwin, 2015) with 2 million unlabeled molecules is employed for self-supervised pre-training, and a preprocessed ChEMBL dataset (Mayr et al., 2018) with 456K labeled molecules is used for graph-level supervised pre-training. Eight binary classification datasets contained in MoleculeNet (Wu et al., 2018) serve as downstream tasks. Results. Tab. 1 reports the performance of proposed GraphLoG method compared with other works. Among all self-supervised learning strategies, our approach achieves the best performance on seven of eight tasks, and a 3% performance gain is obtained in terms of average ROC-AUC. After applying a subsequent graph-level supervised pre-training, our models' performance is further promoted. In particular, a 2.9% increase is observed on the SIDER dataset. Also, the comparison between two supervised methods without self-supervised pre-training is presented in the table, the proposed sup-GraphLoG outperforms the vanilla GIN model with random initialization, which demonstrates the benefit of learning global-semantic structure. The training curves of eight downstream tasks are provided in Sec. G. 3 ± 1.9 70.1 ± 5.4 67.0 sup-GraphLoG (ours) 71.1 ± 0.3 72.9 ± 0.2 63.8 ± 0.1 61.4 ± 0.6 64.0 ± 0.6 72.5 ± 1.0 76.7 ± 0.5 76.5 ± 1.1 69.9 EdgePred (2016) 67.3 ± 2.4 76.0 ± 0.6 64.1 ± 0.6 60.4 ± 0.7 64.1 ± 3.7 74.1 ± 2.1 76.3 ± 1.0 79.9 ± 0.9 70.3 InfoGraph ( 2019) 68.2 ± 0.7 75.5 ± 0.6 63.1 ± 0.3 59.4 ± 1.0 70.5 ± 1.8 75.6 ± 1.2 77.6 ± 0.4 78.9 ± 1.1 71.1 AttrMasking (2019) 64.3 ± 2.8 76.7 ± 0.4 64.2 ± 0.5 61.0 ± 0.7 71.8 ± 4.1 74.7 ± 1.4 77.2 ± 1.1 79.3 ± 1.6 71.1 ContextPred (2019) 68.0 ± 2.0 75.7 ± 0.7 63.9 ± 0.6 60.9 ± 0.6 65.9 ± 3.8 75.8 ± 1.7 77.3 ± 1.0 79.6 ± 1.2 70.9 GraphPartition (2020) 70.3 ± 0.7 75.2 ± 0.4 63.2 ± 0.3 61.0 ± 0.8 64.2 ± 0.5 75.4 ± 1.7 77.1 ± 0.7 79.6 ± 1.8 70.8 GraphLoG (ours) 73.9 ± 0.7 76.2 ± 0.2 64.2 ± 0.5 61.7 ± 1.2 78.6 ± 1.5 76.4 ± 1.0 78.2 ± 0.6 83.3 ± 1.4 74.1 Supervised 68.3 ± 0.7 77.0 ± 0.3 64.4 ± 0.4 62.1 ± 0.5 57.2 ± 2.5 79.4 ± 1.3 74.4 ± 1.2 76.0 ± 1.0 70.0 EdgePred* (2016) 66.6 ± 2.2 78.3 ± 0.3 66.5 ± 0.3 63.3 ± 0.9 70.9 ± 4.6 78.5 ± 2.4 77.5 ± 0.8 79.1 ± 3.7 72.6 InfoGraph* (2019) 68.4 ± 1.0 77.6 ± 0.7 65.3 ± 0.4 62.5 ± 0.7 73.8 ± 1.9 79.3 ± 1.6 78.0 ± 1.1 82.4 ± 1.3 73.4 AttrMasking* (2019) 66.5 ± 2.5 77.9 ± 0.4 65.1 ± 0.3 63.9 ± 0.9 73.7 ± 2.8 81.2 ± 1.9 77.1 ± 1.2 80.3 ± 0.9 73.2 ContextPred* (2019) 68.7 ± 1.3 78.1 ± 0.6 65.7 ± 0.6 62.7 ± 0.8 72.6 ± 1.5 81.3 ± 2.1 79.9 ± 0.7 84.5 ± 0.7 74.2 GraphPartition* (2020) 71.1 ± 0.5 77.4 ± 0.4 64.2 ± 0.1 63.4 ± 0.2 72.9 ± 0.4 78.2 ± 0.7 78.6 ± 0.4 80.4 ± 0.2 73.3 GraphLoG* (ours) 74.0 ± 0.8 78.5 ± 0.2 66.5 ± 0.5 64.6 ± 0.8 78.6 ± 0.7 79.5 ± 1.3 80.1 ± 0.7 85.1 ± 0.9 75.9 "*" denotes the model composed of a specific self-supervised pre-training and a subsequent graph-level supervised pre-training. 67.6 ± 0.8 EdgePred (Kipf & Welling, 2016) 70.5 ± 0.7 InfoGraph (Sun et al., 2019) 70.7 ± 0.5 AttrMasking (Hu et al., 2019) 70.5 ± 0.5 ContextPred (Hu et al., 2019) 69.9 ± 0.3 GraphPartition (You et al., 2020) 71.0 ± 0.2 GraphLoG (ours) 72.8 ± 0.4 (b) Ablation study for different loss terms. Results. In Tab. 2a, we report the test ROC-AUC of various self-supervised learning techniques, and more results on biology domain can be found in Sec. C. It can be observed that the proposed GraphLoG method outperforms existing approaches with a clear margin, i.e. a 1.8% performance gain. This result illustrates that the proposed scheme is beneficial to fine-grained downstream tasks. In addition, sup-GraphLoG is able to promote the performance of a plain GIN model by 2.8% on this biological downstream task.

6.4. ANALYSIS

Effect of different loss terms. In Tab. 2b, we analyze the effect of three loss terms on biological function prediction, and we continue using the GIN depicted in Sec. 6.1 in this experiment. When each loss is independently applied (1st, 2nd and 3rd row), the loss for global-semantic structure performs best, which probably benefits from its exploration of data's semantic information. Through combining these losses, the full model (last row) achieves the best performance, which illustrates that the learning of local-instance and global-semantic structure are complementary to each other. We provide more ablation studies on different model components in Sec. E. Results on different GNNs. Fig. 2 (a) presents the effect of self-supervised pre-training on four kinds of GNNs, GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Velickovic et al., 2018) and GIN (Xu et al., 2019) . We can observe that the proposed GraphLoG scheme outperforms two existing methods, AttrMasking and ContextPred, on all configurations, and it avoids the performance decrease relative to random initialization baseline on GAT. Sensitivity of exponential decay rate β. In this experiment, we evaluate our approach's sensitivity to the parameter β. Fig. 2 (c) shows the test ROC-AUC on downstream task using different β values. From the line chart, we can observe that the proposed model's performance is not sensitive to β, which makes the maintenance scheme of hierarchical prototypes easy to tune. Visualization. In Fig. 3 , we utilize t-SNE (Maaten & Hinton, 2008) to visualize the distributions of graph embeddings and hierarchical prototypes on ZINC15 dataset. Compared to the model with only L patch constraint, some feature clusters are formed after constraining the relations between correlated graphs' embeddings by L graph . More obvious feature separation is achieved after applying L global , which illustrates its effectiveness on discovering the global-semantic structure of the data.

7. CONCLUSIONS AND FUTURE WORK

We devise a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised graph representation learning, which models the structure of a set of unlabeled graphs both locally and globally. In this framework, we novelly propose to learn hierarchical prototypes upon graph embeddings to infer the global-semantic structure in graphs. Using the benchmark datasets from both chemistry and biology domains, we empirically verify our method's superior performance over state-of-the-art approaches on different GNN architectures. Our future works will include exploring novel ways to construct correlated graphs, improving selfsupervised learning manners, unifying pre-training and fine-tuning, and extending our framework to other domains such as sociology, physics and material science.

A THEORETICAL ANALYSIS OF CORRELATED GRAPH CONSTRUCTION

In our method, we choose the attribute masking (Hu et al., 2019) strategy to generate correlated graph pairs, which is widely used in recent self-supervised graph representation learning algorithms (Hu et al., 2019; 2020; Qiu et al., 2020) . Since the graph structure has not been changed by the masking operation, the masked node attribute information can be partially recovered by its surrounding neighbors after being fed into a GNN. Therefore, the embeddings of correlated graph pairs can maintain a high degree of consistency in the feature space, which is desirable for the proposed GraphLoG model. We formally elucidate this point as follows. Given an attributed graph G = (V, E, X V , X E ) (X V : node attributes, X E : edge attributes), we assume that its correlated graph G = (V, E, X V , X E ) is obtained by masking the attribute of a node v, i.e. X V ← X V-{v} ∪ {X m v }. Proposition 1. The L-layer GNN can repair the lost information induced by attribute masking operation by I repair I X v , {X ṽ |ṽ ∈ N L v } , where I(•, •) denotes the mutual information, and N L v is the L-hop neighborhood set of v. Proof. Before information propagation by GNN, we define the lost information I lost induced by attribute masking as the conditional entropy of graph G conditioned on its correlated graph G : I lost = H(G|G ) = H(X v ). (16) After the information propagation by GNN, we have: Ĩlost = H(h G |h G ) = H(h G ) -I(h G , h G ), ( ) where Ĩlost is the information lost in the embedding of correlated graph G compared with the embedding of origin graph G. According to the neighborhood aggregation scheme in GNN, we can derive: I(h G , h G ) = H(h G ) -H(h v ) + I X v , {X ṽ |ṽ ∈ N L v } , where I X v , {X ṽ |ṽ ∈ N L v } denotes the recovered information for the masked node v from its L-hop neighbors. Combining Eqs. 17 and 18 leads to: Ĩlost = H(h v ) -I X v , {X ṽ |ṽ ∈ N L v } . ( ) We can derive the information repaired by GNN, I repair , as: I repair = I lost -Ĩlost = I X v , {X ṽ |ṽ ∈ N L v } + H(X v ) -H(h v ) . (20) Since h v is the low-dimensional embedding of node attribute X v , we can deduce that: H(X v ) H(h v ). ) Therefore, combining Eqs. 20 and 21, we can conclude: I repair I X v , {X ṽ |ṽ ∈ N L v } . ( ) We also evaluate the GraphLoG model with different correlated graph construction strategies, and the experimental results can be found in Sec. E.1. It empirically shows that attribute masking is more reliable to our method.

B MORE IMPLEMENTATION DETAILS

Attribute masking strategy. We add an extra dimension in the vector of node/edge attribute and set only that dimension as 1 when the corresponding node/edge is masked. Given a mini-batch of graphs, we mask the same proportion of node/edge attributes in each graph, and, for undirected graphs, the attributes on the both directions of an edge are masked/unmasked. GNN architecture. All the GNNs in our experiments (i.e. GCN (Kipf & Welling, 2017) , Graph-SAGE (Hamilton et al., 2017) , GAT (Velickovic et al., 2018) and GIN (Xu et al., 2019) ) are with 5 layers, 300-dimensional hidden units and a mean pooling readout function. In addition, two attention heads are employed in each layer of the GAT model. (Kipf & Welling, 2016) 70.5 ± 0.7 EdgePred* (Kipf & Welling, 2016) 73.1 ± 0.5 InfoGraph (Sun et al., 2019) 70.7 ± 0.5 InfoGraph* (Sun et al., 2019) 73.7 ± 0.4 AttrMasking (Hu et al., 2019) 70.5 ± 0.5 AttrMasking* (Hu et al., 2019) 74.2 ± 1.5 ContextPred (Hu et al., 2019) 69.9 ± 0.3 ContextPred* (Hu et al., 2019) 74.3 ± 0.6 GraphPartition (You et al., 2020) 71.0 ± 0.2 GraphPartition* (You et al., 2020) 73.5 ± 0.1 GraphLoG (ours)

C MORE RESULTS ON BIOLOGY DOMAIN

72.8 ± 0.4 GraphLoG* (ours) 75.7 ± 0.6 "*" denotes the model composed of a specific self-supervised pre-training and a subsequent graph-level supervised pre-training. In Tab. 3, we report the performance of different approaches on the downstream task of biology domain, and the results before and after applying a subsequent graph-level supervised pre-training are respectively reported. It can be observed that the proposed GraphLoG method outperforms existing approaches with a clear margin under both settings, which illustrates the effectiveness of proposed learning scheme with and without the guidance of graph-level supervisory signal. (Gärtner et al., 2003) 83.7 ± 1.5 57.9 ± 1.3 50.7 ± 0.3 34.7 ± 0.2 node2vec (Grover & Leskovec, 2016) 72.6 ± 10.2 58.6 ± 8.0 graph2vec (Narayanan et al., 2017) 83.2 ± 9.6 60.2 ± 6.9 71.1 ± 0.5 50.4 ± 0.9 75.8 ± 1.0 sub2vec (Adhikari et al., 2018) 61.1 ± 15.8 60.0 ± 6.4 55.3 ± 1.5 36.7 ± 0.8 71.5 ± 0.4 InfoGraph (Sun et al., 2019) 89.0 ± 1.1 61.7 ± 1.4 73.0 ± 0.9 49.7 ± 0.5 82.5 ± 1.4 Contrastive (Hassani & Ahmadi, 2020) 89.7 ± 1.1 62.5 ± 1.7 74.2 ± 0.7 51.2 ± 0.5 84.5 ± 0.6 GraphLoG (ours) 89.9 ± 1.5 63.8 ± 1.6 76.6 ± 4.2 53.0 ± 3.5 85.9 ± 2.9

D MORE RESULTS ON GRAPH CLASSIFICATION BENCHMARKS

Setups. In this experiment, we compare GraphLoG with six self-supervised graph representation learning methods, i.e. random walk (Gärtner et al., 2003) , node2vec (Grover & Leskovec, 2016) , graph2vec (Narayanan et al., 2017) , sub2vec (Adhikari et al., 2018) , InfoGraph (Sun et al., 2019) and Contrastive (Hassani & Ahmadi, 2020) . We strictly follow the linear evaluation protocol in Sun et al. (2019) and report the mean accuracy of 10-fold cross validation. Five conventional graph classification benchmark datasets, i.e. MUTAG (Kriege et al., 2016) , PTC (Kriege et al., 2016) , IMDB-Binary (Yanardag & Vishwanathan, 2015) , IMDB-Multi (Yanardag & Vishwanathan, 2015) and Reddit-Binary (Yanardag & Vishwanathan, 2015) , are used for evaluation. The settings of network architecture, optimizer and training parameters follow those in Sec. 6.1. Results. Tab. 4 presents the comparisons of self-supervised approaches on five graph classification benchmark datasets. The proposed GraphLoG model ranks the first place in every task, and, especially, it outperforms a recent contrastive-learning-based method (Hassani & Ahmadi, 2020) , which demonstrates the effectiveness of learning local-instance and global-semantic structure.

E MORE ABLATION STUDIES E.1 ABLATION STUDY ON CONSTRUCTING CORRELATED GRAPHS

In this part, we analyze three ways of constructing correlated graphs, i.e AttrMasking (Hu et al., 2019) (with 30% attribute masking rate), DropEdge (Rong et al., 2020b) (with 10% edges dropped) and GraphDiffusion (Klicpera et al., 2019) (with heat kernel to derive a denser adjacency matrix),



In our method, we obtain correlated graphs/subgraphs via minor modification on node/edge attributes. Hierarchical prototypes are representative cluster embeddings organized in a hierarchical way.



Figure 1: Illustration of GraphLoG. (a) Correlated graphs are constrained to be adjacently embedded to pursue the local-instance structure of the data. (b) Hierarchical prototypes are employed to discover and refine the global-semantic structure of the data.

GNNs. There are some recent works that explored self-supervised graph representation learning with GNNs. García-Durán & Niepert (2017) learned graph representations by embedding propagation, and Velickovic et al. (2019), Sun et al. (2019) and Hassani

For biology domain, following the settings inHu et al. (2019), 395K unlabeled protein ego-networks are utilized for self-supervised pre-training, and the prediction of 5000 coarsegrained biological functions on 88K labeled protein ego-networks serves as graph-level supervised pre-training. The downstream task is to predict 40 fine-grained biological functions of 8 species.

Figure 2: (a) Experimental results on different GNNs. (b)&(c) Sensitivity analysis of hierarchical prototypes' depth L p and exponential decay rate β. (All results are reported on biology domain.)

Figure 3: The t-SNE (Maaten & Hinton, 2008) visualization of graph embeddings and hierarchical prototypes on ZINC15 database (i.e. the pre-training dataset for chemistry domain).

Training procedure of Local-instance and Global-semantic Learning (GraphLoG). Training set D = {G j } N D j=1 , the number of training iterations N T , hierarchical prototypes' depth L p and exponential decay rate β. Output: The pre-trained GNN. Initialize hierarchical prototypes {c

Test ROC-AUC (%) on molecular property prediction benchmarks. ± 4.5 74.0 ± 0.8 63.4 ± 0.6 57.3 ± 1.6 58.0 ± 4.4 71.8 ± 2.5 75.

Performance comparison and ablation study on biological function prediction benchmark.

Test ROC-AUC (%) on biological function prediction benchmark.

The 10-fold cross validation accuracy (mean ± std %) of self-supervised methods on graph classification benchmarks.

annex

and evaluate them under the proposed GraphLoG framework. As shown in the first segment of Tab. 5, the AttrMasking strategy outperforms other two techniques with a clear margin, which is mainly ascribed to the fact that, compared with dropping or adding edges, masking node attributes can preserve the matching degree of correlated graph pairs to a greater extent (referring to the theoretical analysis in Sec. A).

E.2 ABLATION STUDY ON LOSS FORMAT

We investigate the effect of two loss formats, InfoNCE loss (van den Oord et al., 2018) and hingeloss-based contrastive loss (Hadsell et al., 2006) , on our model. The conventional contrastive loss uses one negative sample for each positive pair and is in a hinge loss form, while the InfoNCE loss employs a large number of negative samples and is in the form of softmax. We modify the loss format in Eqs. 6, 7 and 13 to conduct the comparison. According to the second segment of Tab. 5, the InfoNCE loss marginally improve model's performance, and both losses can achieve superior performance under the GraphLoG framework.

E.3 ABLATION STUDY ON CLUSTERING ALGORITHM

We employ three clustering algorithms to derive hierarchical prototypes in our method and evaluate their corresponding pre-trained models on downstream task. For K-means, in the prototype initialization stage, we perform clustering hierarchically as in Eqs. 9 and 10 and obtain three hierarchies of prototypes with fixed number M 1 = 10, M 2 = 30 and M 3 = 100 for each hierarchy, respectively. Also, we design an adaptive variant of RPCL (Xu et al., 1993) , named as Adaptive-RPCL, which is able to adjust the number of prototypes during training. Specifically, a counter is additionally maintained for each bottom layer prototype to record the number of iterations from the last time that the prototype is updated by graph embeddings. When a counter reaches threshold γ = 100, the corresponding bottom layer prototype is removed, and an upper layer prototype is removed if all the bottom layer prototypes in its corresponding tree are eliminated.In the third segment of Tab. 5, the performance on biological downstream task is reported for the pre-trained models using different clustering algorithms. It can be observed that two RPCL-based clustering methods marginally outperform K-means, and their performance is comparable with each other. These results illustrate that the proposed GraphLoG model is not too sensitive to the selection of clustering algorithm.

E.4 ROBUSTNESS OF CLUSTERING ALGORITHM

In this experiment, we examine the robustness of RPCL clustering algorithm in the proposed GraphLoG model. Specifically, we conduct clustering-based hierarchical prototype initialization for six times and obtain six different self-supervised pre-training models based on distinct initialization. Fig. 4 plots the biological downstream task performance of these six models. We can observe that they perform comparably with each other, which demonstrates that the GraphLoG model is fairly robust to different clustering outputs. Setups. When varying the attribute masking rate to evaluate its sensitivity, other hyperparameters are fixed as the values depicted in Sec. 6.1. In specific, the hierarchical prototypes' depth L p is set as 3, and the exponential decay rate β equals to 0.95.Results. In Fig. 5 , we plot model's performance on the downstream tasks of chemistry and biology domains under different masking rates. The highest test ROC-AUC is achieved when attribute masking rate is around 30%, which means that, under such settings, the constructed correlated graphs benefit the proposed learning scheme most.

G TRAINING CURVES

In Fig. 6 , we plot the training curves of four approaches, i.e. the random initialization baseline, context prediction (Hu et al., 2019) , attribute masking (Hu et al., 2019) and the proposed GraphLoG method, on eight molecular property prediction tasks. From these line charts, we can observe that, through pre-training on a large-scale unlabeled dataset by GraphLoG, GNN model is able to converge at a higher ROC-AUC on the training set compared with other three methods. 

