GRAPH CONTRASTIVE LEARNING WITH MODEL PERTURBATION

Abstract

Graph contrastive learning (GCL) has achieved great success in pre-training graph neural networks (GNN) without ground-truth labels. The performance of GCL mainly rely on designing high quality contrastive views via data augmentation. However, finding desirable augmentations is difficult and requires cumbersome efforts due to the diverse modalities in graph data. In this work, we study model perturbation to perform efficient contrastive learning on graphs without using data augmentation. Instead of searching for the optimal combination among perturbing nodes, edges or attributes, we propose to conduct perturbation on the model architectures (i.e., GNNs). However, it is non-trivial to achieve effective perturbations on GNN models without performance dropping compared with its data augmentation counterparts. This is because data augmentation 1) makes complex perturbation in the graph space, so it is hard to mimic its effect in the model parameter space with a fixed noise distribution, and 2) has different disturbances even on the same nodes between two views owning to the randomness. Motivated by this, we propose a novel model perturbation framework -PERTURBGCL to pre-train GNN encoders. We focus on perturbing two key operations in a GNN, including message propagation and transformation. Specifically, we propose weightPrune to create a dynamic perturbed model to contrast with the target one by pruning its transformation weights according to their magnitudes. Contrasting the two models will lead to adaptive mining of the perturbation distribution from the data. Furthermore, we present randMP to disturb the steps of message propagation in two contrastive models. By randomly choosing the propagation steps during training, it helps to increase local variances of nodes between the contrastive views. Despite the simplicity, coupling the two strategies together enable us to perform effective contrastive learning on graphs with model perturbation. We conduct extensive experiments on 15 benchmarks. The results demonstrate the superiority of PERTURBGCL: it can achieve competitive results against strong baselines across both node-level and graphlevel tasks, while requiring shorter computation time. The code is available at https://anonymous.4open.science/r/PerturbGCL-F17D.

1. INTRODUCTION

Graph neural networks (GNN) (Kipf & Welling, 2016a; Hamilton et al., 2017; Gilmer et al., 2017) have become the de facto standard to model graph-structured data, such as social networks (Li & Goldwasser, 2019) , molecules (Duvenaud et al., 2015) , and knowledge graphs (Arora, 2020) . Nevertheless, GNNs require task-specific labels to supervise the training, which is impractical in many scenarios where annotating graphs is challenging and expensive (Sun et al., 2019) . Therefore, increasing efforts (Hou et al., 2022; Veličković et al., 2018; Hassani & Khasahmadi, 2020; Thakoor et al., 2022) have been made to train GNNs in an unsupervised fashion, so that the pre-trained model or learned representations can be directly applied to different downstream tasks. Recently, graph contrastive learning (GCL) becomes the state-of-the-art approach for both graphlevel (You et al., 2020; 2021; Suresh et al., 2021; Xu et al., 2021) and node-level (Qiu et al., 2020; Zhu et al., 2021b; Bielak et al., 2021; Thakoor et al., 2022) tasks. The general idea of GCL is to create two views of the original input using data augmentation (Jin et al., 2020) , and then encode them with two GNN branches that share the same architectures and weights (You et al., 2020) . Then, the model is optimized to maximize the mutual information between the two encoded representations according to contrastive objectives, such as InfoNCE (Oord et al., 2018) or Barlow Twins (Zbontar et al., 2021) . As such, the performance of GCL mainly relies on designing high quality contrastive views (Zhang et al., 2021) . Recently, intensive studies (You et al., 2020; Jin et al., 2020; Han et al., 2022) has been devoted to exploring effective augmentation strategies for graph data. Despite their success, finding desirable augmentations requires cumbersome efforts, since the optimal augmentations are domain-specific and vary from graph to graph (You et al., 2020; Yin et al., 2022) . To tackle this problem, SimGRACE (Xia et al., 2022) introduced the idea of model perturbation. Instead of searching for the optimal combination among perturbing nodes, edges or attributes in the graph space, SimGRACE conducts perturbation in a unified parameter space by adding Gaussian noise to model weights. However, we observe that SimGRACE may lead to sub-optimal representations compared with its data augmentation counterparts because of two reasons. Firstly, the data augmentation in the graph space is rather complicated and beyond Gaussian distribution. As a result, the weight perturbation based on Gaussian noises cannot achieve similar effects as data perturbation on representation learning (as illustrated in Section 2.2) . Secondly, the weight perturbation does not consider local variances among different nodes in a graph, since the perturbation is data-agnostic. Therefore, it still remains an important yet unsolved challenge to develop effective model perturbation framework for GCL, so that it can produce effective representations on both node and graph learning tasks in a more efficient manner. To tackle these challenges, in this work, we propose a novel framework -PerturbGCL to train GNN encoders via model perturbation. Different from SimGRACE (Xia et al., 2022) that only focuses on weight perturbation, we make one step further to disturb the message passing (MP) of GNNs, since it allows to provide local disturbances between contrastive views. Specifically, we present weightPrune to construct a perturbed model by pruning the transformation weights of the target one. Unlike the Gaussian noise in SimGRACE (Xia et al., 2022) , the pruned model will co-evolve with the target GNNs, leading to an adaptive mining of the noise perturbation from the data, i.e., datadriven. Furthermore, we propose randMP to offer local disturbances on nodes among contrastive views. It works by conducting k times of message propagation steps in each contrastive model, where k is randomly sampled on-the-fly. Informally, performing MP k times can be thought of as conducting convolution on the anchor node's k-hops of neighbors (Gao et al., 2018) . On this basis, we can learn diverged but correlated representations from the two contrastive models with different k values due to the homophily theory (Altenburger & Ugander, 2018) . Coupling the two strategies together yields a principled model perturbation solution tailored for GCL, whose effectiveness and efficiency have been empirically verified through our extensive experiments. We summarize our main contributions as follows: • We introduce Perturbed Graph Contrastive Learning (PerturbGCL), a principled contrastive learning method on graphs that works by perturbing GNN architectures. PerturbGCL is flexible and easy to implement. To the best of our knowledge, PerturbGCL is the first model perturbation work that can achieve promising results on both node and graph learning tasks. • PerturbGCL innovates to perturb GNN architectures from both the message passing and model weight perspectives via two effective perturbation strategies: randMP and weightPrune. By applying the two strategies jointly in contrastive models, PerturbGCL can be learned to mimic the effect of data augmentation from the model perturbation aspect. • Extensive experiments across 15 benchmark datasets demonstrate the superiority of our proposal. Specifically, PerturbGCL can outperform state-of-the-art baselines without using data augmentation across two evaluation scenarios. Moreover, PerturbGCL is easy to optimize and runs generally faster than the strong GCL baselines.

2.1. NOTATIONS AND PRELIMINARIES

Notations. Let G = (V, E, X) be an undirected graph, where V is the set of nodes and E is the set of edges. X ∈ R |V|×F is the node feature matrix where the i-th row of X denote the F -dimensional feature vector of the i-th node in V. We use f w denote the mapping function that encodes each node v ∈ G into a D-dimensional representation h v ∈ R D .

