ITERATIVE GRAPH SELF-DISTILLATION

Abstract

How to discriminatively vectorize graphs is a fundamental challenge that attracts increasing attentions in recent years. Inspired by the recent success of unsupervised contrastive learning, we aim to learn graph-level representation in an unsupervised manner. Specifically, we propose a novel unsupervised graph learning paradigm called Iterative Graph Self-Distillation (IGSD) which iteratively performs the teacher-student distillation with graph augmentations. Different from conventional knowledge distillation, IGSD constructs the teacher with an exponential moving average of the student model and distills the knowledge of itself. The intuition behind IGSD is to predict the teacher network representation of the graph pairs under different augmented views. As a natural extension, we also apply IGSD to semi-supervised scenarios by jointly regularizing the network with both supervised and unsupervised contrastive loss. Finally, we show that finetuning the IGSDtrained models with self-training can further improve the graph representation power. Empirically, we achieve significant and consistent performance gain on various graph datasets in both unsupervised and semi-supervised settings, which well validates the superiority of IGSD.

1. INTRODUCTION

Graphs are ubiquitous representations encoding relational structures across various domains. Learning low-dimensional vector representations of graphs is critical in various domains ranging from social science (Newman & Girvan, 2004) to bioinformatics (Duvenaud et al., 2015; Zhou et al., 2020) . Many graph neural networks (GNNs) (Gilmer et al., 2017; Kipf & Welling, 2016; Xu et al., 2018) have been proposed to learn node and graph representations by aggregating information from every node's neighbors via non-linear transformation and aggregation functions. However, the key limitation of existing GNN architectures is that they often require a huge amount of labeled data to be competitive but annotating graphs like drug-target interaction networks is challenging since it needs domainspecific expertise. Therefore, unsupervised learning on graphs has been long studied, such as graph kernels (Shervashidze et al., 2011) and matrix-factorization approaches (Belkin & Niyogi, 2002) . Inspired by the recent success of unsupervised representation learning in various domains like images (Chen et al., 2020b; He et al., 2020) and texts (Radford et al., 2018) , most related works in the graph domain either follow the pipeline of unsupervised pretraining (followed by fine-tuning) or InfoMax principle (Hjelm et al., 2018) . The former often needs meticulous designs of pretext tasks (Hu et al., 2019; You et al., 2020) while the latter is dominant in unsupervised graph representation learning, which trains encoders to maximize the mutual information (MI) between the representations of the global graph and local patches (such as subgraphs) (Veličković et al., 2018; Sun et al., 2019; Hassani & Khasahmadi, 2020) . However, MI-based approaches usually need to sample subgraphs as local views to contrast with global graphs. And they usually require an additional discriminator for scoring local-global pairs and negative samples, which is computationally prohibitive (Tschannen et al., 2019) . Besides, the performance is also very sensitive to the choice of encoders and MI estimators (Tschannen et al., 2019) . Moreover, MI-based approaches cannot be handily extended to the semi-supervised setting since local subgraphs lack labels that can be utilized for training. Therefore, we are seeking an approach that learns the entire graph representation by contrasting the whole graph directly without the need of MI estimation, discriminator and subgraph sampling. Motivated by recent progress on contrastive learning, we propose the Iterative Graph Self-Distillation (IGSD), a teacher-student framework to learn graph representations by contrasting graph instances directly. The high-level idea of IGSD is based on graph contrastive learning where we pull sim-

