GRAPH CONTRASTIVE LEARNING WITH MODEL PERTURBATION

Abstract

Graph contrastive learning (GCL) has achieved great success in pre-training graph neural networks (GNN) without ground-truth labels. The performance of GCL mainly rely on designing high quality contrastive views via data augmentation. However, finding desirable augmentations is difficult and requires cumbersome efforts due to the diverse modalities in graph data. In this work, we study model perturbation to perform efficient contrastive learning on graphs without using data augmentation. Instead of searching for the optimal combination among perturbing nodes, edges or attributes, we propose to conduct perturbation on the model architectures (i.e., GNNs). However, it is non-trivial to achieve effective perturbations on GNN models without performance dropping compared with its data augmentation counterparts. This is because data augmentation 1) makes complex perturbation in the graph space, so it is hard to mimic its effect in the model parameter space with a fixed noise distribution, and 2) has different disturbances even on the same nodes between two views owning to the randomness. Motivated by this, we propose a novel model perturbation framework -PERTURBGCL to pre-train GNN encoders. We focus on perturbing two key operations in a GNN, including message propagation and transformation. Specifically, we propose weightPrune to create a dynamic perturbed model to contrast with the target one by pruning its transformation weights according to their magnitudes. Contrasting the two models will lead to adaptive mining of the perturbation distribution from the data. Furthermore, we present randMP to disturb the steps of message propagation in two contrastive models. By randomly choosing the propagation steps during training, it helps to increase local variances of nodes between the contrastive views. Despite the simplicity, coupling the two strategies together enable us to perform effective contrastive learning on graphs with model perturbation. We conduct extensive experiments on 15 benchmarks. The results demonstrate the superiority of PERTURBGCL: it can achieve competitive results against strong baselines across both node-level and graphlevel tasks, while requiring shorter computation time. The code is available at https://anonymous.4open.science/r/PerturbGCL-F17D.

1. INTRODUCTION

Graph neural networks (GNN) (Kipf & Welling, 2016a; Hamilton et al., 2017; Gilmer et al., 2017) have become the de facto standard to model graph-structured data, such as social networks (Li & Goldwasser, 2019) , molecules (Duvenaud et al., 2015) , and knowledge graphs (Arora, 2020) . Nevertheless, GNNs require task-specific labels to supervise the training, which is impractical in many scenarios where annotating graphs is challenging and expensive (Sun et al., 2019) . Therefore, increasing efforts (Hou et al., 2022; Veličković et al., 2018; Hassani & Khasahmadi, 2020; Thakoor et al., 2022) have been made to train GNNs in an unsupervised fashion, so that the pre-trained model or learned representations can be directly applied to different downstream tasks. Recently, graph contrastive learning (GCL) becomes the state-of-the-art approach for both graphlevel (You et al., 2020; 2021; Suresh et al., 2021; Xu et al., 2021) and node-level (Qiu et al., 2020; Zhu et al., 2021b; Bielak et al., 2021; Thakoor et al., 2022) tasks. The general idea of GCL is to create two views of the original input using data augmentation (Jin et al., 2020) , and then encode them with two GNN branches that share the same architectures and weights (You et al., 2020) . Then, the model is optimized to maximize the mutual information between the two encoded representations according to contrastive objectives, such as InfoNCE (Oord et al., 2018) or Barlow Twins (Zbontar et al., 2021) . As such, the performance of GCL mainly relies on designing high quality contrastive views (Zhang et al., 2021) . Recently, intensive studies (You et al., 2020; Jin et al., 2020; Han et al., 2022) has been devoted to exploring effective augmentation strategies for graph data. Despite their success, finding desirable augmentations requires cumbersome efforts, since the optimal augmentations are domain-specific and vary from graph to graph (You et al., 2020; Yin et al., 2022) . To tackle this problem, SimGRACE (Xia et al., 2022) introduced the idea of model perturbation. Instead of searching for the optimal combination among perturbing nodes, edges or attributes in the graph space, SimGRACE conducts perturbation in a unified parameter space by adding Gaussian noise to model weights. However, we observe that SimGRACE may lead to sub-optimal representations compared with its data augmentation counterparts because of two reasons. Firstly, the data augmentation in the graph space is rather complicated and beyond Gaussian distribution. As a result, the weight perturbation based on Gaussian noises cannot achieve similar effects as data perturbation on representation learning (as illustrated in Section 2.2) . Secondly, the weight perturbation does not consider local variances among different nodes in a graph, since the perturbation is data-agnostic. Therefore, it still remains an important yet unsolved challenge to develop effective model perturbation framework for GCL, so that it can produce effective representations on both node and graph learning tasks in a more efficient manner. To tackle these challenges, in this work, we propose a novel framework -PerturbGCL to train GNN encoders via model perturbation. Different from SimGRACE (Xia et al., 2022) that only focuses on weight perturbation, we make one step further to disturb the message passing (MP) of GNNs, since it allows to provide local disturbances between contrastive views. Specifically, we present weightPrune to construct a perturbed model by pruning the transformation weights of the target one. Unlike the Gaussian noise in SimGRACE (Xia et al., 2022) , the pruned model will co-evolve with the target GNNs, leading to an adaptive mining of the noise perturbation from the data, i.e., datadriven. Furthermore, we propose randMP to offer local disturbances on nodes among contrastive views. It works by conducting k times of message propagation steps in each contrastive model, where k is randomly sampled on-the-fly. Informally, performing MP k times can be thought of as conducting convolution on the anchor node's k-hops of neighbors (Gao et al., 2018) . On this basis, we can learn diverged but correlated representations from the two contrastive models with different k values due to the homophily theory (Altenburger & Ugander, 2018) . Coupling the two strategies together yields a principled model perturbation solution tailored for GCL, whose effectiveness and efficiency have been empirically verified through our extensive experiments. We summarize our main contributions as follows: • We introduce Perturbed Graph Contrastive Learning (PerturbGCL), a principled contrastive learning method on graphs that works by perturbing GNN architectures. PerturbGCL is flexible and easy to implement. To the best of our knowledge, PerturbGCL is the first model perturbation work that can achieve promising results on both node and graph learning tasks. • PerturbGCL innovates to perturb GNN architectures from both the message passing and model weight perspectives via two effective perturbation strategies: randMP and weightPrune. By applying the two strategies jointly in contrastive models, PerturbGCL can be learned to mimic the effect of data augmentation from the model perturbation aspect. • Extensive experiments across 15 benchmark datasets demonstrate the superiority of our proposal. Specifically, PerturbGCL can outperform state-of-the-art baselines without using data augmentation across two evaluation scenarios. Moreover, PerturbGCL is easy to optimize and runs generally faster than the strong GCL baselines.

2.1. NOTATIONS AND PRELIMINARIES

Notations. Let G = (V, E, X) be an undirected graph, where V is the set of nodes and E is the set of edges. X ∈ R |V|×F is the node feature matrix where the i-th row of X denote the F -dimensional feature vector of the i-th node in V. We use f w denote the mapping function that encodes each node v ∈ G into a D-dimensional representation h v ∈ R D . Graph Neural Networks. To learn representations on graph data, we use graph neural networks (GNN) (Kipf & Welling, 2016a; Hamilton et al., 2017; Gilmer et al., 2017) as the encoder f w . Without loss of generality, we present GNN as a message passing network: h (l) v = σ(a (l-1) v W l ), a (l-1) v = g (l-1) (h (l-1) v , {h (l-1) u : u ∈ N v }), where h (l) v ∈ R D is the intermediate representation of node v at the l-the layer, N v denotes the direct neighbors of node v. We use g to denote the message propagation (MP) function, which updates node representations by integrating its neighbors with a transformation function (Wu et al., 2019) . W l ∈ R D×D is the transformation weight matrix and σ is the activation function, such as ReLU. We often use the final layer's output as node-level representations, i.e., h v = h L v where L is the number of layers in a GNN. To get the graph-level representation h G ∈ R D for graph G, we further aggregate all node-level representations in a graph via a readout function: h G = READOUT({h L v : v ∈ V}), z G = h(h G ) = MLP(h G ). Here READOUT(•) can be the simple average pooling function or more sophisticated ones (Ying et al., 2018; Gao & Ji, 2019) . h(•) is the projection head, and z G ∈ R D denotes the embedding towards loss estimation. In the development of our method, we follow the existing graph contrastive learning practices and consider three state-of-the-art GNNs: GCN (Kipf & Welling, 2016a), GIN (Xu et al., 2018) , and ResGCN (Chen et al., 2019) . Graph Contrastive Learning. Graph contrastive learning (GCL) (You et al., 2020) has become an state-of-the-art approach for pre-training GNNs without ground-truth labels (Liu et al., 2021b; a) . Unlike reconstruction-based methods (Perozzi et al., 2014; Kipf & Welling, 2016b) , GCL is built upon a contrastive objective between the so-called positive pairs and negative pairs generated from the original data. Formally, given an anchor node v, let (z v , z + v ) denote the representations of positive pairs and (z v , z - v ) be the negative pairs. The contrastive loss could be defined as: L CL = 1 |V | v∈V -log exp(sim(z v , z + v )/τ ) exp(sim(z v , z + v )/τ ) + u∈V,u̸ =v exp(sim(z v , z - u )/τ ) , ( ) where τ is the temperature parameter, sim(•, •) denotes the similarity function. By minimizing Eq. 3, the GNN encoder will be trained to enforce the similarity of the positive pairs while enlarging the distance of negative pairs in the hidden space. It is also worth noting that some GCL variants (Thakoor et al., 2022; Bielak et al., 2022) do not rely on negative samples. The key question in contrastive learning is how to generate effective positive (or negative) pairs. To this end, graph augmentation has been adopted as the golden rule in GCL (You et al., 2020; Jin et al., 2020) . Typical graph augmentation techniques include edge perturbation, node masking and attribute masking.

2.2. THE PROPOSED PERTURBGCL FRAMEWORK

Motivation: How to perform model perturbation on GCL? Given a graph G = (A, X) where A is the adjacency matrix. For illustration purposes, we consider the mapping function f w (A, X; W) with one simple GCN (Wu et al., 2019) layer without activation function. Then, the hidden representation is computed by f w (A, X; W) = g(A, X)W, where W is the weight matrix and g(•, •) is the message propagation operation defined in Eq. 1. g(A, X) = D -1 2 (A+I G ) D -1 2 X v , where D ii = j (A+I G ) ij is the degree matrix and I G is the identity matrix. An intuitive solution is adding Gaussian noise to the model weight and derive two contrastive views f ′ w (A, X; W) and f ′′ w (A, X; W), as done in SimGRACE (Xia et al., 2022) : f ′ w (A, X; W) = g(A, X)W, f ′′ w (A, X; p(W)) = g(A, X)p(W), where p(W) = W + η∆ W is the perturbation function on model weight, and ∆ W ∼ N (0, δ 2 ) represents the noise term sampled from Gaussian distribution with zero mean and variance δ 2 , and η is a hyperparameter to scale the magnitude of the perturbation. Since the learning task is to minimize the distance between the two views, SimGRACE trains the model so that it is robust to the Gaussian noise in the weight. (Thakoor et al., 2022) on the same perturbed graphs. Black circles ( ) indicate the results of baselines. Orange circles ( ) represent the results of SimGRACE. SimGRACE performs worse than standard GCL methods in modeling common perturbations achieved by data augmentation. Detailed experimental setups are listed in Appendix B. However, we found that the contrastive views created by SimGRACE could be sub-optimal compared with its data-augmentation counterparts (see Figure 1 ). To measure the quality of representations learned by GCL models, we consider two popular metrics: alignment and uniformity (Wang & Isola, 2020) expressed as: L align (f w ; α) ≜ E (x,y)∼Ppos [||h x -h y || α 2 ], α > 0 L uniform (f w ; t) ≜ log E (x,y) i.i.d ∼ Pdata [e -t||hx-hy|| 2 2 ]. t > 0 P pos is the distribution of positive pairs, i.e., augmentations of the same sample, P data is the data distribution. L align is used to measure if positive samples stay close in the hidden space. L uniform is used to measure if random samples are scattered on the hypersphere of hidden space. In our experiments, we set α = t = 2 following (Xia et al., 2022) . Figure 1 reports the test results of SimGRACE and augmentation-based GCL methods (Zhang et al., 2021; Thakoor et al., 2022; Zhu et al., 2020) on the same perturbed graphs created by random edge and attribute masking. Black circles ( ) and rrange circles ( ) represent the performance of baselines and SimGRACE, respectively. As can be observed, SimGRACE performs worse than three baselines across different datasets with a great margin. These results shed light on the bottleneck of SimGRACE in capturing common perturbations achieved by data augmentation. This is because SimGRACE is limited by the Gaussian noise and cannot handle perturbations in graph space. We thus ask: Can we design more advanced model perturbation strategies so that they can achieve similar functions as data augmentation done in GCL? What is behind data augmentation in GCL? To answer this question, we start by analyzing the working mechanism behind the standard GCL framework. Let T (•) and q(•) denote augmentation functions on topology structure and node attributes, respectively. The two contrastive representations f ′ w (A, X; W) and f ′′ w (A, X; W) of standard GCL (You et al., 2020) are defined as: f ′ w (A, X; W) = g(T ′ (A), q ′ (X))W, f ′′ w (A, X; W) = g(T ′′ (A), q ′′ (X))W. That is, standard GCL framework learns to be robust to small disturbances (created by (T ′ (•), q ′ (•)) and ((T ′′ (•), q ′′ (•))) on the graph. We can easily observe two major properties of data augmentation as follows. ❶ It can disturb different nodes in a graph (different graphs in a graph set) differently. This is because T (•) and q(•) are random functions, such as edge masking, so they can have distinct effects even for the same input. ❷ The perturbation distribution incurred by data augmentation (e.g., (T ′ (•), q ′ (•))) is complicated and often beyond Gaussian noise. As analyzed in Figure 1 , although SimGRACE is trained to be robust to Gaussian noise to some extent, it cannot handle the perturbation in the graph space well. These properties motivate us to design tailored model perturbation strategies from the two aspects. Our PerturbGCL proposal. To birdge the gap, we propose a novel model perturbation framework called PerturbGCL. Figure 2 provide the overview our framework. In this work, we build PerturbGCL upon the GraphCL pipeline (You et al., 2020) , and follows its major components, such as two GNN branches and a non-linear projection head. The main difference is that GraphCL augments the input graph to get two views and then process them with two branches that share the same GNN architecture and weights, while PerturbGCL processes the original input graph with two nonsymmetric GNN branches. One branch uses the original GNN model f w , while the other branch ! " ! ""

Projection head

Projection head

Maximize agreement

Input graph: # (%, ') ) ) ⊙ + , -.

Message propagation

Weight prune ! " ~0( 1 , 2 ) ! " " ~0( 1 , 2 ) 3 3 . - ℎ(⋅) 6 , (%, '; ! " , )) 6 , " (%, '; ! "" , ) ⊙ + , ) ! " times ! "" times Figure 2: The overview of the proposed PerturbGCL framework. The original graph is fed into two asymmetric GNN branches: one is the target encoder f w to be trained, and the other is the perturbed version f ′ w that is pruned from the former online. The two branches share weights for their nonpruned parameters. Either branch has independent message propagation (MP) operations perturbed by a random number, i.e., k, to disturb nodes locally. Since the pruned branch is always obtained and updated from the latest target model, the two branches will co-evolve during training. disturbs the message propagation process and model weights of f w . We conduct perturbation on two major GNN operations, including message passing (MP) and transformation. On this basis, we introduce the following simple yet effective perturbation strategies: randMP and weightPrune, towards effective model augmentation for GNN architectures. Strategy #1: weightPrune. Recently, model pruning has attracted increasing attention for model compression thanks to the popularity of the lottery ticket hypothesis (Frankle & Carbin, 2018) . Work in (Chen et al., 2021) found that GNN can be pruned to a sparse sub-network without significant performance drop using rewinding techniques. These observations indicate the latent of pruning as a practical perturbation approach to GNN's weights. Inspired by this, we propose weightPrune, which creates the perturbed branch by pruning the model parameters of the target encoder. Specifically, assuming W denote the weights of the target branch, m w be the mask of the pruned branch, which has the same size as W. At each iteration, we prune the target branch according to a pre-defined prune ratio s according to the magnitude of weight values, i.e., masking weights out if their magnitudes are ranked below s. By changing s, we can control the distortion degree of the target branch to a certain extent. After that, the target and perturbed branches will use W and W ⊙ m w respectively as model weights to generate representations, which are then fed into the contrastive loss. Since the mask indicator m w is continuously updated from the latest target model, the two branches will co-evolve during training. Strategy #2: randMP. Message passing is another critical component of the GNN architecture, since it offers the flexibility to aggregate features from multi-hop neighbors. Performing MP for k times over the graph G is equivalent to updating node v's representation based on its k-hop local subgraph. In light of this, k could be naturally regarded as a perturbation factor, where different k values generate diverse but semantically correlated representations for the same node. To implement randMP, we will randomly sample two k values at each iteration: one is for the target branch (i.e., k ′ ) and the other for the perturbed branch (i.e., k ′′ ). Formally, if we assume g(A, X) = D -1 2 (A + I G ) D -1 2 X v , performing MP k times gives g(A, X) k = ( D -1 2 (A + I G ) D -1 2 ) k X v . In experiments, we consider sampling k during the training because it may enforce the GNN encoder to learn generable representations invariant to different combinations of local enclosing graphs. To sum up, different from existing methods (Xia et al., 2022; Thakoor et al., 2022) that disturb model weights with Gaussian noise, we suggest a principled approach to effectively perturb GNN architectures from their message propagation and feature transformation perspectives, via two simple perturbation techniques: randMP and weightPrune. randMP aims to map the same input graph into two semantically similar representations by conducting a random number of message passing steps. Meanwhile, weightPrune targets to increase the diversity of two representations via model pruning. Combining the two strategies enables us to spot the sweet point between the two view representations (Tian et al., 2020) , i.e., correlated but diverged enough. The complete optimization procedure of our model is outlined in Algorithm 1 and 2 in Appendix.

2.3. MORE DISCUSSIONS ON PERTURBGCL

PerturbGCL is complementary to other GCL efforts. We focus on training GCL by only using model perturbation. It can be easily combined with existing contrastive learning advances, such as mature graph augmentation techniques (You et al., 2020) and the negative-sample free contrastive loss objectives (Bielak et al., 2022; Thakoor et al., 2022) , as we will show in Section 3.4. Computational complexity analysis. In addition to saving a lot of time in searching for the optimal data augmentation strategies, we analyze the complexity of PerturbGCL. Given a graph G = (V, E) and the GNN encoder f w . The time complexity for most popular GNN architectures (Kipf & Welling, 2016a; Veličković et al., 2017; Gilmer et al., 2017) is O(|E| + |V|), where O(|E|) and O(|V|) are mainly caused by the message propagation and feature transformation operations, respectively. PerturbGCL performs two encoder computations per update step (one for each GNN branch) plus a node-level projection head. Assuming that the backward pass to be approximately as costly as a forward pass and ignoring the cost for weight pruning as it is small and negligible. Thus the total time complexity per update step for PerturbGCL is 4C encoder (K|E| + |V|s) + 2C head (|V|) + C loss , where C . are constants depending on architecture of the different components, K is the maximum number of MP operations considered (e.g., K = 3), and s is the pruning ratio. It is worth noting that although our model at most takes K times MP operations in forward pass, due to weight pruning (e.g., s = 70%), the computation costs for feature transformation and backpropagation are significantly lower than standard GCL methods. Therefore, the total running cost of PerturbGCL can be further accelerated in practice. We empirically analyze the efficiency of our model in Section 3.5.

3. EXPERIMENTS

In this section, we evaluate the performance of PerturbGCL. Specifically, we first visualize the weight distribution and the alignment between positive pairs in Section 3.1 to investigate what the proposed two strategies actually do. Then we evaluate the effectiveness of PerturbGCL in node classification with several benchmarks and SOTA baselines in Section 3.2. Next, we test the performance of PerturbGCL in graph classification under both unsupervised and semi-supervised settings in Section 3.3. After that, we evaluate the contributions of different components in Section 3.4, as well as how it improves the efficiency of GCL against standard methods in Section 3.5. The experiment setting and more experiments are summarized in Appendices A and E. Through the experiments, the main observations are highlighted. Apparently, performing more MP steps will increase the diversity of two positive views since L align increases. Sweet spots (i.e., minimum performance gap) exist across three scenarios. regularization can improve the generalization ability of neural networks (Scholkopf & Smola, 2018) , we believe that why the proposed weightPrune can improve the performance. To investigate the effect of randMP, we report the impacts of different k values on PerturbGCL in terms of the alignment between positive views. From Figure 4 , we observe that ② randMP can improve the diversity of contrastive views when k increases, and sweet points widely exist across three datasets. In Figure 4 , with the increase of k, the generalization gap tends to first decrease to the sweet points and then increase a little bit. It validates the effect of randMP in generating correlated but diverged views. On the other hand, it indicates the potential of randMP to improve the generalization ability, i.e., these sweet points.

3.2. CAN PERTURBGCL PERFORM WELL ON NODE CLASSSIFICATION TASK?

We first examine the effectiveness of PerturbGCL in node classification. Results of 15 baseline methods across 6 benchmark datasets are collected in Table 1 . We make the following observations: ③ PerturbGCL can achieve better node classification results than SOTA GCL methods without using data augmentation. From Table 1 , PerturbGCL gains 4 best performances among 6 evaluation scenarios. On average, it ranks 2.33 among 13 augmentation-based baselines including strong methods, such as BGRL and CCA-SSG, which indicates the power of model perturbation based contrastive learning. Meanwhile, ④ PerturbGCL outperforms the model perturbation baseline -SimGRACE with great margins. Among 6 datasets, SimGRACE loses to PerturbGCL in all cases. Specifically, PerturbGCL improves SimGRACE 6.11%, 3.53%, 2.34%, 2.26%, 1.95%, and 1.56% on Cora, PubMed, Computer, Photo, CS, and Phy, respectively. This result is in line with our analysis in Section 2.2.

3.3. CAN PERTURBGCL GENERALIZE WELL TO GRAPH CLASSIFICATION TASK?

To validate the effectiveness of PerturbGCL on graph classification, we compare it with state-ofthe-art graph-level GCL methods on different datasets. the results on unsueprvised and semi-supervised settings, respectively. We observed that ⑤ Pertur-bGCL generally performs better than other baselines across two graph learning tasks. From the unsupervised setting (See Table 2 ), PerturbGCL achieves the best (or comparable best) results on 6 of 8 datasets, and obtain substantial improvements on COLLAB and IMDB-B datasets. In the semi-supervised setting (See Table 9 in Appendix), PerturbGCL generally performs better than other baselines across 7 comparisons and always ranks top three on all the datasets. These results demonstrate the effectiveness of PerturbGCL on graph learning task.

3.4. ABLATION STUDY

We investigate the contributions of different components in PerturbGCL. Figure 5 and Figure 9 in Appendix report the results on graph and node datasets, respectively. We observe that ⑥ Pertur-bGCL benefits from the combination of randMP with weightPrune. From the figures, Pertur-bGCL consistently outperforms two variants (i.e., w/o MP and w/o WP) in all cases, which indicates the reciprocal effects of using randMP and weightPrune together. Moreover, ⑦ replacing weight-Prune with Gaussian noise, PerturbGCL drops significantly. In both node and graph scenarios, PerturbGCL outperforms the "noise" variant with a great margin. It verifies the effectiveness of the proposed weightPrune strategy. We also test the results of PerturbGCL under different contrastive objectives, such as Barlow Twins (Bielak et al., 2021) , Bootstrap (Thakoor et al., 2022) , and InfoNCE (You et al., 2020) (Please refer to Appendix C for details). From Figure 5 (middle), we observe that ⑧ PerturbGCL performs generally better on InfoNCE and Barlow Twins objectives. Given that InfoNCE is standard contrastive loss and Barlow Twins is negative-sample free, PerturbGCL is ready to be applied on scenarios with informative negative sample or without negative samples by using different losses.

3.5. CAN PERTURBGCL IMPROVE THE TRAINING EFFICIENCY OF GCL?

We compare the proposed PerturbGCL with strong GCL baselines in terms of the training costs in Table 3 and report the optimization curves in Figure 5 (right panel). Experimental configurations are listed in Appendix D. We observed that ⑨ using pure model perturbation, PerturbGCL is training efficient. From Table 3 , we can see that PerturbGCL runs significantly faster per epoch than strong baselines in general, and the performance gap is particularly evident in graph datasets. Besides, PerturbGCL can converge within one hundred epochs in practice as shown in Figure 5 (Right). Thus, the total training time of PerturbGCL could be further reduced. 

3.6. FURTHER ANALYSIS

We finally investigate the sensitivity of PerturbGCL w.r.t. the propagation degree K and prune ratio s in Figure 10 (Left) of Appendix, the impact of graph augmentation on PerturbGCL in Figure 10 (Middle), and the learning capacity of PerturbGCL in Figure 10 (Right). We can observe that ⑩ PerturbGCL performs stably when K ∈ [1, 2, 3, 4, 5] and s ∈ [0.7, 0.9]. In Figure 10 left, the performance of PerturbGCL when K = 0.9 (or 0.7) is consistently better than others. PerturbGCL is complementary with advanced graph augmentation. From Figure 10 (Middle), by feeding the augmented graphs as input, PerturbGCL can be further improved. However, the trade-off is that the improvement is not huge but the time to search optimal augmentation strategies is costing. Although PerturbGCL is trained based on original graphs, it can generalize to perturbed graphs well. As shown in Figure 10 (Right), PerturbGCL has lower L align and L uniform values than SimGRACE and strong GCL baseline, which indicates the effectiveness of the proposed model.

4. RELATED WORK

We briefly introduce some related graph contrastive learning methods (Xie et al., 2022) and refer readers to (Zhou et al., 2020) for a comprehensive review of graph neural networks. Graph contrastive learning with data augmentation. Similar to contrastive learning on images (Chen et al., 2020) , data augmentation is crucial to the success of contrastive learning on graphs (GCL). Recently there has been steady progress (You et al., 2020; 2021; Qiu et al., 2020; Lee et al., 2022; Luo et al., 2022) in designing or identifying informative augmentation strategies to boost the performance of GCL. GraphCL (You et al., 2020) introduces four augmentation prototypes for graphs, including node dropping, edge perturbation, attribute masking, and subgraph sampling. MoCL (Sun et al., 2021) and G-Mixup (Han et al., 2022) propose to utilize domain knowledge such as bioisosteres and graphon to aid augmentation. AutoGCL (Yin et al., 2022) , JOAO (You et al., 2021) , and GPA (Zhang et al., 2022) suggest to leverage extra AutoML techniques (Waring et al., 2020) to free human labor on augmentation choices. Unlike this line of research, we focus on training GCL methods without explicitly graph augmentation. Graph contrastive learning without data augmentation. To eliminate the influence of graph augmentation on GCL, AFGRL (Lee et al., 2022) suggests sampling nodes that share similar semantic information in the hidden space as positive samples. However, it requires non-negligible clustering efforts to spot positive pairs in the learning process. SimGRACE (Xia et al., 2022) proposes to train GCL by disturbing the model weight using Gaussian noise. However, SimGRACE could limit applications beyond Gaussian distribution, as illustrated in Figure 1 . In this work, we introduce a tailored model perturbation framework for GNN encoders without constraining the noise distribution.

5. CONCLUSION

In this work, we explore how to perform contrastive learning on graphs without using data augmentation, and propose a principled framework -PerturbGCL, which is built upon pure model perturbations. Specifically, motivated by the fact that GNN can be divided into message propagation and feature transformation operations, we develop two tailored perturbation strategies: randMP and weightPrune, to effectively disturb GNN's two crucial operations accordingly. We build connections between our model perturbation strategies and well-established graph augmentation techniques to understand the working mechanism of PerturbGCL. Through extensive experiments across multiple datasets and different graph learning tasks, we show that PerturbGCL can achieve competitive results against strong baselines, while requiring substantially shorter computation time for training.

A EXPERIMENT SETUP

In this section, we introduce the datasets used in our experiments. Specifically, we adopt the following 6 popular node-level datasets and summarize their statistics in Table 4 . • Cora, and PubMed: They are two widely used citation network datasets (Sen et al., 2008) . Nodes represent documents and edges denote citation links. Each node has a sparse bagof-the-words feature vectors. Labels are defined as the academic topics. • Amazon-Computers and Amazon-Photo: They are two networks of co-purchase relationships constructed from Amazon (McAuley et al., 2015) . Nodes indicate goods and edges represent the co-purchase relationships of two products. Each node has a sparse bag-ofwords feature encoding products reviews and is labeled with its category. They are widely used for node classification task. Nodes represent authors and edges indicate co-authorship relationships. Each node has a sparse bag-of-words feature based on paper keywords of the author. The task is to predict the most active research field of authors. • Coauthor-CS and Coauthor-Physics: They are two academic networks, which represent co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge (Sinha et al., 2015) . Moreover, we also consider 9 graph-level benchmark datasets to verity the effectiveness of Pertur-bGCL on graph-learning task. Specifically, we use 5 social networks (COLLAB, REDDIT-BINARY, REDDIT-MULTI-5K, IMDB-BINARY, and GITHUB), and 2 molecules networks (NCI1 and MU-TAG), and 2 bioinformatics networks (PROTEINS and DD) from the benchmark TUDdataset (Morris et al., 2020) . Table 5 lists their statistics. Dataset. We use 2 Planetoid graphs (Cora and PubMed), and 4 widely used datasets (Shchur et al., 2018) (Amazon-Computers, Amazon-Photo, Coauthor-CS, and Coauthor-Physics) for experiments. For Cora and PubMed, we follow the common semi-supervised practice (Kipf & Welling, 2016a) to generate train/val/test data splits without any modifications. For Amazon-Computers, Amazon-Photo, Coauthor-CS and Coauthor-Physics, since there are no data splits available, so similar to BGRL (Thakoor et al., 2022) and GBT (Bielak et al., 2022) , we generate 20 random train/val/test splits (10%/10%/80%). Competitors. To have a rigorous and comprehensive comparison with state-of-the-art methods, we compare PerturbGCL with 2 classical unsupervised models: DeepWalk (Perozzi et al., 2014) and GAE (Kipf & Welling, 2016b), 9 standard self-supervised models, DGI (Veličković et al., 2018) , GRACE (Zhu et al., 2020) , MVGRL (Hassani & Khasahmadi, 2020) , GCA (Zhu et al., 2021b) , BGRL (Thakoor et al., 2022) , GBT (Bielak et al., 2022) , InforGCL (Xu et al., 2021) , CCA-SSG (Zhang et al., 2021) , AFGRL (Lee et al., 2021) , and one model perturbation method -SimGRACE (Xia et al., 2022) . We also compare with supervised learning models including GCN (Kipf & Welling, 2016a) and GAT (Veličković et al., 2017) . The results of baselines are quoted from (Zhang et al., 2021; Xu et al., 2021; Xia et al., 2022) if not specified. Evaluation protocol. We follow the popular linear evaluation scheme to evaluate the performance of unsupervised models. Specifically, we first pre-train the model on the given graph without using ground-truth labels. Then, we freeze the parameters of the encoder and use it to generate node representations. After that, the generated node representations would be fed into a linear classification, We implement PerturbGCL with PyTorch and use Adam optimizer to train the model. The graph encoder f w is specified as a standard two-layer GCN model for all the datasets. We have two hyperparameters (pruning ratio s and random propagation step K) to tune. For each dataset, we search K ∈ [1, 2, 3] and s ∈ [0.5, 0.7, 0.9]. To avoid randomness, we report the mean accuracy with a standard deviation through 10 random initialization. The detailed hyperparameter settings are summarized in Table 6 .

A.2 SETUP FOR UNSUPERVISED GRAPH CLASSIFICATION

Dataset. For the unsupervised graph classification task, we adopt 8 benchmark datasets (NCI1, PROTEINS, DD, MUTAG, COLLAB, RDT-B, RDT-MSK, and IMDB-B) for experiments, following (You et al., 2020) . There are 20 epochs pre-training under the naive-strategy. After the Competitors. We compare with the kernel-based methods like graphlet kernel (GL) (Shervashidze et al., 2009) , Weisfeiler-Lehman sub-tree kernel (WL) (Shervashidze et al., 2011) , and deep graph kernel (DGK) (Yanardag & Vishwanathan, 2015) , and other unsupervised graph representation meodels like node2vec (Grover & Leskovec, 2016) , sub2vec (Adhikari et al., 2018) , graph2vec (Narayanan et al., 2017) , as well as the state-of-the-art GCL methods like MVGRL (Hassani & Khasahmadi, 2020), InforGraph (Sun et al., 2019) , GraphCL (You et al., 2020) , JOAO (You et al., 2021) , and SimGRACE (Xia et al., 2022) . Evaluation protocol. Following GraphCL (You et al., 2020) , contrastively train the representation model using unlabeled graph data, and then fix the representation model and train a downstream classifier using labeled data. Specifically, we adopt SVM as the classifier and perform 10-fold cross validation. For each fold, we employ 90% of the data as the labeled data for training and the remaining 10% as the labeled testing data. To avoid randomness, we repeatedly run experiments for 5 times and report the averaged results. Following GraphCL (You et al., 2020) , we use GIN (Xu et al., 2018) as the GNN backbone, and also search the best K and s from {1, 2, 3} and {0.5, 0.7, 0.9}, respectively. Table 7 reports the parameter configurations for all datasets. Dataset. We perform semi-supervised graph classification task 7 popular benchmark datasets (NCI1, PROTEINS, DD, COLLAB, RDT-B, RDT-M5K, and GITHUB) from TUDataset (Morris et al., 2020) . There are 100 epochs pre-training under the default setting. Competitors. We compare with unsupervised graph representation meodels: GAE (Kipf & Welling, 2016b), Infomax (DGI) (Veličković et al., 2018) , and ContextPred (Hu et al., 2019) , and other stateof-the-art GCL methods like InforGraph (Sun et al., 2019) , GraphCL (You et al., 2020) , JOAO (You et al., 2021) , and SimGRACE (Xia et al., 2022) . Evaluation protocol. We employ a 10-fold cross validation on each dataset. For each fold, we use 80% of the data as the unlabeled data, 10% as labeled training data, and 10% as labeled testing data. For the augmentation only (Augmentations) experiments, we only perform 30 epochs of supervised training with augmentations using labeled data. Following GraphCL (You et al., 2020) , we use ResGCN (Chen et al., 2019) as the GNN backbone, and also search the best K and s from {1, 2, 3} and {0.5, 0.7, 0.9}, respectively. Table 8 reports the parameter configurations for all datasets. Algorithm 1 PerturbGCL on node level task Compute the target representation h v according to 1 by performing k ′ times of g() 7: Get z v = h(h v ) according to 2 8: Compute the perturbed representation h + v according to 1 by performing k ′′ times of g() and using masked weight W ⊙ m w 9: Get z + v = h(h v ) 10: define L CL = 1 |V | v∈V -log exp(sim(zv,z + v )/τ ) exp(sim(zv,z + v )/τ )+ u∈V,u̸ =v exp(sim(zv,z - u )/τ ) according to 3 11: Optimize f w (•), h(•) to minimize L CL 12: end for 13: return the pre-trained GNN encoder f w (•)

C MORE CONTRASTIVE LOSS FUNCTIONS

Although InfoNCE (You et al., 2020) (illustrated in Eq. 3) is the widely used contrastive objective in learning GCL models, some other training objectives have been proposed recently, such as Barlow Twins (Bielak et al., 2021) , Bootstrap (Thakoor et al., 2022) . Different from InfoNCE, the other two contrastive objectives are negative-sample free, so they can avoid the efforts to identify informative negative samples during the training. Specifically, the core idea of Bootstrap function is to maximize the difference between the positive pairs, defined as: L = - 2 N N -1 v∈V z v h + v ∥z v ∥ h + v . Here h + v is the hidden representation encoded by GNN encoder, and z v = h(h + v ) , where h(•) is the prediction head. In our cases, since we take the original graph as input, so we do not have symmetric define of the loss function as done in (Thakoor et al., 2022) . This might be the reason why our model PerturbGCL performs not good using this objective. Barlow Twins (Bielak et al., 2021 ) is a recent endeavor to reduce the usage of negative samples. This objective is originally proposed in image domain by (Zbontar et al., 2021) . The general idea of Barlow Twins is to minimize the redundancy in the hidden dimension. Specifically, given the hidden representation of two views (Z and Z + ), it first compute the empirical cross-correlation matrix C ∈ R D×D as below: C i,j = n Z n,i Z + n,j n (Z n,i ) 2 n Z + n,j 2 , ( ) where n is the the batch indexes and i, j are the indexes of embeddings. The cross-correlation matrix C is optimized to be equal to the identity matrix. To be specific, it is composed of two parts: 1) the invariance term and 2) the redundancy reduction term. The first one forces the on diagonal elements C i,i to be equal to one, hence making the embeddings invariant to the applied augmentations. The second term optimizes the off-diagonal elements C i,j to be equal to zero -this results in decorrelated components of the embedding vectors. Formally, the loss function L BT is computed by: L BT = i (1 -C i,i ) 2 + λ i j̸ =i C 2 i,j . The λ > 0 parameter defines the trade-off between the invariance and redundancy reduction terms. In our experiments, we set λ = 1 D following (Bielak et al., 2021) . (Thakoor et al., 2022) , SimGRACE (Xia et al., 2022) , GRACE (Zhu et al., 2020) , CCA-SSG (Zhang et al., 2021) , and our PerturbGCL on the same perturbed graphs generated by data augmentation. Black circles ( ) indicate the baselines. Orange circles ( ) represent the performance of SimGRACE. Red starts (⋆) are the results of PerturGCL.

D EFFICIENCY ANALYSIS

To evaluate the efficiency of the proposed PerturbGCL, we compare it with two strong GCL baselines: BGRL (Thakoor et al., 2022) on node classification data and GraphCL (You et al., 2020) on graph classification data. It is worth to note that since BGRL and GraphCL require to search for the optimal augmentation strategies via trial-and-error, which is super expensive in practice. To simplify the comparison, we fix their optimal augmentations and only record the running time on the optimal augmentations. For graph-level datastes, GraphCL and PerturbGCL use 3 GIN layers with hidden dimension 32 as the backbone encoder. The propagation step and pruning ratio of PerturbGCL are set as K = 2 and s = 0.7. For node-level datasets, BGRL and PerturbGCL use 2 GCN layers with hidden dimension 512 as the backbone encoder. The propagation step and pruning ratio of PerturbGCL are set as K = 3 and s = 0.9. We conduct experiments on a server with AMD EPYC 7282 16-Core processors, 252 GB memory, and one GeForce RTX 2080 Ti GPUs (24GB). To avoid randomness, the reported results in Table 3 are the averaged performance over 100 training epochs. Except the improvement in Table 3 , another thing we want to mention is that the practical speedup of PerturbGCL should be significantly higher than the results shown in Table 3 . This is mainly because the data augmentation based GCL methods require a lot of efforts to search for the best augmentation strategies, such as the best augmentation types and their corresponding perturbation ratios. 



Figure1: The alignment and uniformity performance (↓) of SimGRACE and augmentation-based GCL method(Thakoor et al., 2022) on the same perturbed graphs. Black circles ( ) indicate the results of baselines. Orange circles ( ) represent the results of SimGRACE. SimGRACE performs worse than standard GCL methods in modeling common perturbations achieved by data augmentation. Detailed experimental setups are listed in Appendix B. However, we found that the contrastive views created by SimGRACE could be sub-optimal compared with its data-augmentation counterparts (see Figure1). To measure the quality of representations learned by GCL models, we consider two popular metrics: alignment and uniformity(Wang & Isola, 2020) expressed as:

Figure 3: Visualization of weight distribution (from left to right: initial weights, PerturbGCL w/o. weightPrune, and PerturbGCL) on Coauthor-Phy. The x-axis indicates weight values and y-axis is the count. Obviously, the number of activated neurons after using weightPrune is significantly smaller than others. It shows that weightPrune can regularize the model.

Figure 4: The visualization of PerturbGCL w.r.t. different k values on the original graphs and the perturbed graphs generated by data augmentation. The x-axis indicates propagation steps and yaxis is the L align ↓. The gap between the blue and orange lines indicate the generalization ability.Apparently, performing more MP steps will increase the diversity of two positive views since L align increases. Sweet spots (i.e., minimum performance gap) exist across three scenarios. Table1: Test accuracy on benchmark datasets in terms of node classification. We report both mean accuracy and standard deviation. A.R. denotes the averaged rank.

Figure 5: Left: Ablation study of PerturbGCL. Middle: The impact of different contrastive objectives. Right: Empirical training curves of PerturbGCL with different s values.

Input: Original graph G = (V, E, X), GNN encoder f w (•) with weight W , projection head h(•), the maximum propagation step K, and pruning ratio s 2: Initialize the encoder f w (•) and set mask indicator to ones. 3: for iterate 1, 2, ... times until convergence do 4: Sample the random propagation steps k ′ , k ′′ from the uniform distribution U (1, K) 5: Conduct weight pruning to update the mask indicator m w 6:

Figure6: The alignment and uniformity plot for BGRL(Thakoor et al., 2022), SimGRACE(Xia et al., 2022), GRACE(Zhu et al., 2020), CCA-SSG(Zhang et al., 2021), and our PerturbGCL on the same perturbed graphs generated by data augmentation. Black circles ( ) indicate the baselines. Orange circles ( ) represent the performance of SimGRACE. Red starts (⋆) are the results of PerturGCL.

Figure 7: Empirical training curves of PerturbGCL on graph benchmarks with different pruning ratios s.

Figure 8: Visualization of second layer weight distribution during training process(from left to right: initial weights, PerturbGCL w/o. weightPrune, and PerturbGCL) on Coauthor-CS. The x-axis indicates weight values and y-axis is the corresponding count.

Figure 10: Left: Hyperparameter Analysis on Coauthor-CS. Middle: PerturbGCL with data augmentation. Right: The alignment and uniformity results of PerturbGCL.

Test accuracy on benchmark datasets in terms of node classification. We report both mean accuracy and standard deviation. A.R. denotes the averaged rank. ± 0.2 89.88 ± 0.33 93.22 ± 0.28 93.27 ± 0.17 95.69 ± 0.10 3.83 SimGRACE 78.5 ± 0.3 79.3 ± 0.5 86.42 ± 0.35 91.55 ± 0.22 92.37 ± 0.33 94.37 ± 0.15 10.67 PerturbGCL 83.3 ± 0.5 82.10 ± 0.37 88.45 ± 0.77 93.62 ± 0.40 94.18 ± 0.09 95.85 ± 0.08 2.33

Table 2 and Table 9 (in Appendix) report Test accuracy on benchmark datasets in TUdatasets in terms of the unsupervised setting for graph classification.means that results are not available in published papers.

Running time per epoch (in seconds). Baseline indicates BGRL and GraphCL for node and graph classification, respectively. All the methods are evaluated on GeForce RTX 2080 Ti GPUs.

Dataset statistics of node-level benchmarks.

Dataset statistics of graph-level benchmarks.

Hyperparameters of PerturbGCL on node classification. We useGCN (Kipf & Welling,  2016a)  as the backbone encoder. .e., a simple logistic regression model, to make the prediction for each node. It is worth noting that only nodes in the training set are used as supervision when training the classifier, and we report the accuracy results on testing nodes.

Hyperparameters of PerturbGCL on unsupervised graph classification. We use GIN(Xu et al., 2018) as the backbone encoder.

Hyperparameters of PerturbGCL on semi-supervised graph classification. We use Res-GCN(Chen et al., 2019) as the backbone encoder.

annex

Conduct weight pruning to update the mask indicator m w 7:Compute the target representation h v according to 1 by performing k ′ times of g()

8:

Get z G = READOUT(h(h v ) v∈Gn ) according to 2 9:Compute the perturbed representation h + v according to 1 by performing k ′′ times of g() and using masked weight W ⊙ m w 10:end for 14: end for 15: return The pre-trained GNN encoder f (•)

B DETAILS FOR TOY EXAMPLE

To verify the limitation of SimGRACE (Xia et al., 2022) on handling perturbation created by data augmentation. We select three popular data augmentation based baselines: GRACE (Zhu et al., 2020) , BGRL (Thakoor et al., 2022) , and CCA-SSG (Zhang et al., 2021) . To measure the qualify of the representation models on learning representations for input data, we adopt the widely used alignment and uniformity metrics (Wang & Isola, 2020) for quantitative analysis. According to (Wang & Isola, 2020) , both metrics are the smaller the better.Evaluation setting. For all methods, we first pre-train them according to their own configurations on PubMed, Amazon-Photo, and Coauthor-CS datasets. Then, we use data augmentation strategies to construct two perturbed views. Specifically, we following this detailed empirical study (Zhu et al., 2021a) and adopt edge perturbation and attribute masking as the default perturbation function on the input graph. To have a fair comparison, we fix the random seed and generated two shared perturbed graphs, and then feed the two views into BGRL, GRACE, CCA-SSG, and SimGRACE to obtain node representations for all nodes in the graph. After that, we use the obtained node representations of two views to compute the alignment and uniformity according to Eq. 5. WE repeat the process for 10 times and report the averaged results in Figure 1 . 

