G-CENSOR: GRAPH CONTRASTIVE LEARNING WITH TASK-ORIENTED COUNTERFACTUAL VIEWS

Abstract

Graph Contrastive learning (GCL) has achieved great success in learning representations from unlabeled graph-structure data. However, how to automatically obtain the optimal contrastive views w.r.t specific downstream tasks is little studied. Theoretically, a downstream task can be causally correlated to particular substructures in graphs. The existing GCL methods may fail to enhance model performance on a given task when the task-related semantics are incomplete/preserved in the positive/negative views. To address this problem, we propose G-CENSOR, i.e., Graph Contrastive lEarniNg with taSk-oriented cOunteRfactual views, a modelagnostic framework designed for node property prediction tasks. G-CENSOR can simultaneously generate the optimal task-oriented counterfactual positive/negative views for raw ego-graphs and train graph neural networks (GNNs) with a contrastive objective between the raw ego-graphs and their corresponding counterfactual views. Extensive experiments on eight real-world datasets demonstrate that G-CENSOR can consistently outperform existing state-of-the-art GCL methods to improve the task performance and generalizability of a series of typical GNNs. To the best of our knowledge, this is a pioneer investigation to explore task-oriented graph contrastive learning from a counterfactual perspective in node property prediction tasks. We will release the source code after the review process.

1. INTRODUCTION

Inspired by the convincing success of contrastive learning in the domain of computer vision (Chen et al., 2020; He et al., 2020) and natural language processing (Gao et al., 2021) , graph contrastive learning (GCL) has become an emerging field that extends the idea to graph data (You et al., 2020a; Hassani & Ahmadi, 2020; Zhu et al., 2021; Li et al., 2022) , leading to generalizable, transferable and robust representations from unlabeled graph data (You et al., 2021) . Nevertheless, the generation mechanism of contrastive views, which has been recognized as an essential component in GCL (Zhu et al., 2021; Yin et al., 2022; You et al., 2021) , is still facing the following challenges: (a) Independent of downstream tasks. Although GCL is originally proposed for self-supervised learning, how to obtain the optimal positive view when downstream tasks are available can be an important question Xie et al. (2022) . However, most prior works, whether based on graph diffusion (Hassani & Ahmadi, 2020) , uniform sampling (Zhu et al., 2020) , or adaptive sampling (Zhu et al., 2021; You et al., 2021) , ignore the downstream tasks' information. As shown in Figure 1 , whether a generated view is a appropriate positive view depends critically on the downstream tasks Chen et al. (2020) . (b) Fitting spurious correlations. To introduce task information, learnable data augmentation has been investigated to automatically obtain the positive views for downstream tasks (Yin et al., 2022) . While these techniques have achieved promising performance, they are prone to be plagued by spurious correlations between graph structures and downstream tasks like general supervised methods, thus hurting the generalizability of representation model. (c) Difficulty in negative views selection. Beside positive views, negative sampling is also a vital component in GCL. Contrastive learning can benefit from hard negative samples (Joshua et al., 2021) . Meanwhile, negative samples, actually similar to the raw instances, can lead to a performance drop (Chuang et al., 2020) . Therefore, it can be hard to select suitable negative samples. Some works (He et al., 2020) utilize a great number of negative samples to avoid this trade-off but may cause scalability problems. These challenges can become more non-trivial with graph data since graph data are far more complex due to the non-Euclidean property (Zhu et al., 2021) . Figure 1 : An illustration for task-oriented contrastive views. Task A is to predict whether a node is in a triangle and task B is to predict the color of a node. A task-oriented view is positive if and only if it contains the credible evidence for the task label, otherwise it should be negative. In this paper, we propose a novel model-agnostic framework for node property prediction tasks, namely G-CENSOR, i.e., Graph Contrastive lEarniNg with taSk-oriented cOunteRfactual views. G-CENSOR generates high-quality positive and negative views simultaneously from a counterfactual perspective. In other words, a task-oriented counterfactual question about the contrastive views could be asked: "Would a judgment on the task label of an ego-graph change if part of the structure of the ego-graph were erased?" The answer no should be assigned to a positive view while the answer to a negative view should be yes. Technically, G-CENSOR adopts the learnable view generation approach and leverages an original counterfactual optimization objective to decompose an ego-graph into the sub-structures causally correlated to the downstream tasks and the sub-structures spuriously correlated to the downstream tasks. This two parts can further be regarded as positive and negative views, respectively. Learning representation model with such contrastive views can enhance both the model's task performance and generalizability. Notably, G-CENSOR doesn't need to contrast in-batch negative samples, this characteristic can help G-CENSOR get rid of performance-memory trade-off inherent in most prior GCL methods. The core contribution of this paper can be three-fold: (a) To the best of our knowledge, this is a pioneer investigation to explore task-oriented graph contrastive learning from a counterfactual perspective in node property prediction tasks. (b) We develop a novel model-agnostic framework, G-CENSOR, an approach to automatically generate both task-oriented counterfactual positive and negative views to enable sufficient and efficient graph contrastive learning. (c) We conduct extensive experiments on eight real-world datasets to demonstrate the superiority of G-CENSOR over existing state-of-the-art GCL methods to improve the performance and generalizability of typical GNNs.

2.1. GRAPH CONTRASTIVE LEARNING

Inspired by the success of data augmentation and contrastive learning in texts and images to address the data noise and data scarcity issues, Many graph contrastive learning (GCL) frameworks have been proposed lately (Liu et al., 2022; Xie et al., 2022) . GCL works usually consist of a graph views generation component to construct positive and negative views, and a contrastive objective to discriminate positive pairs from negative pairs (Xie et al., 2022) . Most works generate positive views by uniform random transformation, e.g. node dropping, edge perturbation and subgraph sampling (Zhu et al., 2020; You et al., 2020b; Yu et al., 2020; Zhao et al., 2021; You et al., 2020a; Hassani & Ahmadi, 2020; Sun et al., 2020; Velickovic et al., 2019) . Zhu et al. (2021) proposed the adaptive data augmentation strategies to reflect input graph's intrinsic patterns, i.e., assign larger drop probabilities to unimportant edges. Recently, several works have proposed trainable augmentation strategies (You et al., 2021; Li et al., 2022; Yin et al., 2022) to learn drop probability distribution over nodes or edges. However, few works have discussed on how to generate an optimal task-oriented positive and negative views for graph data, and no learnable augmentation strategy has been proposed for node property prediction task. Table 1 lists the comparison between G-CENSOR and the other state-of-the-art GCL models on 4 different properties. 

2.2. COUNTERFACTUAL GRAPH EXPLANATION

Recently, the research on explainability of GNNs is experiencing a rapid development (Yuan et al., 2021) . Most of GNN explanation methods (Ying et al., 2019; Luo et al., 2020) focus on identifying a subgraph of the original graph that contributes most to the prediction of a trained GNN. There are also works make explanations based on retrieving similar existed instances (Faber et al., 2020) . Since removing the explanation subgraph from the input graph does not necessarily change the prediction (Bajaj et al., 2021) , some counterfactual GNN explanation techniques (Lucic et al., 2022; Bajaj et al., 2021) have been proposed. These techniques address a problem that identifies a small subset of edges of the input graph instance such that removing those edges significantly result in an alternative prediction (Lucic et al., 2019; 2022; Bajaj et al., 2021) . While all above works deal with post hoc explanations of estimated GNNs, this study emphasizes ad hoc detection of the causally correlated subgraph with respect to tasks. See Section 4.2 for details.

3. PRELIMINARIES AND PROBLEM FORMULATION

In this section, we provide formal definitions of the optimal task-oriented counterfactual positive/negative views with insightful explanations, and formulate the research problems. Let G = (V, E) be an undirected graph with nodes v i ∈ V and edges (v i , v j ) ∈ E. For each node v i , a feature description Haahr, 2007) of a node v i can be defined as G vi,k = (V vi,k , E vi,k , v i ), where v i indicates the ego, V vi,k ⊆ V is the set of all the nodes that are at most k hops away from v i , and x vi = [x 1 , x 2 , • • • , x i , • • • , x d ] E vi,k ⊆ E is the set of interconnected edges between nodes in V vi,k . G ′ vi,k = (V ′ vi,k , E ′ vi,k , v i ) can be an ego-subgraph of G vi,k , where E ′ vi,k ⊆ E vi,k , and V ′ vi,k is nodes involved in E ′ vi,k . The complement of G ′ vi,k w.r.t G vi,k is defined by Ḡ′ vi,k = ( V ′ vi,k , Ē′ vi,k , v i ), where Ē′ vi,k is E vi,k \E ′ vi,k , and V ′ vi,k is the nodes involved in Ē′ vi,k . Note that G + vi,k and G - vi,k must be connected subgraphs thus isolated edges in them will be removed. For brevity, we will omit v i and k in the subsequent sections. Definition 1 (Optimal Task-oriented Counterfactual Positive View G + ). An ego-subgrah G ′ is the optimal task-oriented counterfactual positive view G + for an ego-graph instance G if and only if E ′ contains and only contains all the edges that are causally correlated to the task label. Definition 2 (Optimal Task-oriented Counterfactual Negative View G -). An ego-subgrah G ′ is the optimal task-oriented counterfactual negative view G -for an ego-graph instance G if and only if Ḡ′ is the complement of the optimal task-oriented counterfactual positive view G + w.r.t G. Figure 2 explains G + and G -from a perspective of ego-graph generation process for node property prediction tasks. In both cases, the relationship between E 1 and y can be stable since E 1 is the direct cause or effect of y, i.e., causally-correlated to y. The joint distribution of E 0 with y would be different if ego set changes, since E 0 and y are spuriously correlated (Schölkopf et al., 2012; Joshi & He, 2022) conditioned on ego, i.e.v. Therefore, according to the Definition 1, G + should be G 1 = (V 1 , E 1 , v); according to the Definition 2, G -should be G 0 = (V 0 , E 0 , v). Now let ϕ (G) : {G | v i ∈ V} → R c be any GNN that maps an ego-graph G to a probability distribution p of the ego over the label space R c . Contrastive learning with task-oriented counterfactual views is defined as: Definition 3 (Graph Contrastive Learning with Task-oriented Counterfactual Views). Given a G with its G + and G -, learning a function ϕ that maximizes the consistency between pair (ϕ(G), ϕ(G + )) compared with pair (ϕ(G), ϕ(G -)). ) and E 1 (with nodes V 1 ). Only the structure of (E 1 ) determines the label (y). (b) The label (y) is an inherited property of the ego (v) and this property causes E 1 (with nodes V 1 ). Meanwhile, other properties of v would cause (E 0 ). For interpretative examples, refer to Appendix A.1; for the causal diagrams of k-hop (k > 1) ego-graph generation process, refer to Appedix A.2. As Figure 2 shows, GNNs trained in a normal flow may fit the spurious correlations between G - and y since they take the whole observed graph G as input. Learning such non-causal association would reduce the reliability and the generalizability of the GNNs. In this study, we train GNNs with the auxiliary graph contrasting learning with task-oriented counterfactual views. Our goal is to make the learned representation to be as independent as possible from G -, so that the model can achieve better generalizability. Assuming that we can have a ϕ that is a sufficient encoder (Chen et al., 2020) , where an input graph G can be encoded without information loss. Let R(ϕ) represents the empirical risk of estimation and L(ϕ(G), y) represents the estimation loss, we can make a following assumption: Assumption 1. Given a set {(G, y)}, ϕ estimated with {(G -, y)} suffers a greater empirical risk compared to ϕ estimated with {(G + , y)} if ϕ is sufficient, i.e., R( φ-) = E[L( φ-(G -), y)] > R( φ+ ) = E[L( φ+ (G + ), y)], where φ = arg min ϕ R(ϕ) and L(ϕ(G), y) can be 1 arg max i [ϕ(G)]i̸ =y . In Appedix A.1, we provide an example to explain the reasonableness of Assumption 1. Let φ (G; Λ) be a function parameterized by Λ, which takes G as input and produces the optimal task-oriented positive/negative views. We can enforce φ satisfy the signature in Assumption 1. Based on the above definitions and assumption, in this paper, we investigate the following two research questions: Question 1. How to design and learn a model-agnostic φ under the Assumption 1 to conduct graph contrastive learning with task-oriented counterfactual views? Question 2. Can we really enhance ϕ on given tasks with graph contrastive learning with taskoriented counterfactual views produced by φ?

4. METHODOLOGY

To answers the Question 1, we illustrate the proposed framework G-CENSOR in this section. The overall architecture of G-CENSOR is shown in Figure 3 .

4.1. BASE GNN MODELS

We select GraphSAGE Hamilton et al. (2017) (abbreviated as SAGE), GAT Veličković et al. (2018) and GIN Xu et al. (2019) as base GNN models, i.e., ϕ, to validate the G-CENSOR. These three models were chosen because they are representative and have been applied in many real scenarios. For details of these three models, refer to Appendix B. Then we can minimize a general prediction loss L pred , e.g., the negative log likelihood (NLL), to get the estimated model φ as follows:  L pred (ϕ) = E {Gv|∀v∈V} [NLL(ϕ(G v ; Θ), y v )]

4.2. COUNTERFACTUAL TRANSFORMATION MODULE (CTM)

Figure 4 depicts the proposed counterfactual transformation module for ego-graph, i.e., φ, to estimate the probability that an edge is part of the optimal task-oriented counterfactual positive view G + for a G. Representation of an edge. To make an edge as distinguishable as possible, we preserve both the feature and structural information in the representation of an edge. Technically, for a node u in an ego-graph G v , we first assign an ego identifier defined as 1 u=v and then construct auxiliary structural information leveraging Distance Encoding (DE) Li et al. ( 2020) defined as landing probabilities of random walks of different lengths from ego v to the node u, which is denoted by r u . Therefore, a new node feature vector for u is given as xu|Gv = x u ∥ r u|Gv ∥ 1 u=v , where r u|Gv = [(AD -1 ) u,v , (AD -1 ) 2 u,v , . . . , (AD -1 ) l u,v ]. (2) A is the adjacency matrix for G v such that A u,ς = 1 iff (u, ς) ∈ E v and D is the degree (diagonal) matrix for G v where D u,u is the degree of node u. l indicates the length of random walks, which is set to the same as k. Then an edge (u, ς) in G v is represented as e uς|Gv = xv|Gv ∥ xu|Gv ∥ xς|Gv . ( ) Probability that an edge belongs to G + v . For simplicity, we assume the binary variable 1 (u,ς)∈E + v follows a Bernoulli distribution, i.e., 1 (u,ς)∈E + v ∼ Ber(θ uς|Gv ) and θ uς|Gv is estimated by multiple multilayer perceptrons (MLPs) as follows: P (u, ς) ∈ E + v = θ uς|Gv = S 1 m m i=1 MLP i e uς|Gv ; Λ i , where S = e x 1+e x is the sigmoid funtion. Assuming all the edges are independent of each other, the probability of a sub-egograph G ′ v being G + v , i.e., P (G + v = G ′ v ) can be obtained by P (G + v = G ′ v ) = Π (u,ς)∈E ′ v P (u, ς) ∈ E + v Π (u,ς)∈Ev\E ′ v 1 -P (u, ς) ∈ E + v . Note that once G + v has been obtained, we can further construct G - v = Ḡ+ v . Due to the discrete nature, we adopt the reparameterization trick Jang et al. (2017) ; Maddison et al. (2017) to enable updating parameters in the MLPs with general gradient-based optimizer. See Appendix C for details. Computational Complexity. The complexity for the edge representation e uς|Gv is O(l • |V v | • |E v |) due to the computation of xu|Gv (Yuster & Zwick, 2005) and the complexity for the probability estimation is O((1 + h-1 m ) • |E v | • d 2 ) , where we assume the hidden dimension in MLP i is equal to d m and h is the number of layers in MLP i . As a comparison, the complexity for GNN-based flow used in learnable views generation (Yin et al., 2022; Li et al., 2022) is O(k • |E v | • C e + k • |V v | • C v + h • |E v | • d 2 ) where C e and C v are the complexity of message passing and combination respectively. In practice, the latter is likely to be greater than the former due to large C e and C v . See Appendix D for the real runtime evaluation. Optimization objective. To satisfy the signature of G + v in Definition 1, the minimum NLL loss on ϕ + (G + v ; Θ + ) can be minimized to encourage G + v sufficient to predict the correct label y v as follows: L suf f (φ | φ+ ) =E G + v ∼P (G + v =G ′ v ) L pred ( φ+ ) , where φ+ = arg min ϕ + L pred (ϕ + ). Under Assumption 1, the relationship between G + v and G - v can be established by a marginal rank loss that enforces φ-(G - v ; Θ-) results in a greater empirical risk. L rank (φ | φ+ , φ-) =E G + v ∼P (G + v =G ′ v ) max 0, -L pred ( φ-) + L pred ( φ+ ) + δ margin , (7) where φ-= arg min ϕ - L neg (ϕ -). Additionally, to enforce φ to find the most significant edges, G + v should be as sparse as possible, which aligns with the findings of previous studies in causal discovery and counterfactual explanation (Zheng et al., 2018; Wachter et al., 2017) .A L1 regularization L size is applied on the probability of being selected into E + v for all edges: L size (φ) = E ∀(u,ς)∈E + v P (u, ς) ∈ E + v . ( ) Note that the prior proportion of the causally-correlated part is different in different domains. With a parameter α to control L size (φ), the counterfactual loss for the counterfactual transformation module can be formulated as: L cf (φ) = L suf f (φ | φ+ ) + L rank (φ | φ+ , φ-) + α • L size (φ). The Difference with Counterfactual Explanation. To address task rather than model, L suf f maximizes the mutual information between G + v and y v while counterfactual explanation works maximize the mutual information between G + v and φ(G v ; Θ). Counterfactual explanation works directly maximize the empirical risk L pred (ϕ -), which is helpful to explain an estimated model but can sort spurious structures into G + v since spurious structures is also correlated to label. Therefore, we choose a rank loss to ensure consistency with Assumption 1.

4.3. CONTRASTIVE LEARNING WITH TASK-ORIENTED COUNTERFACTUAL VIEWS

Contrasting the raw ego-graphs to both the positive and negative counterfactual views can inject task knowledge from φ into ϕ and lead to better model performance and generalizability. An In-foNCE (van den Oord et al., 2018) style loss is formulated as L cl (ϕ) =E {(Gv,G + v ,G - v )|∀v∈V} -log exp (sim(G v , G + v )/τ ) exp(sim(G v , G + v )/τ ) + exp sim(G v , G - v )/τ , where sim(G v , G ′ v ) = 1-JSD(ϕ(G v ), ϕ(G ′ v )) and JSD is the Jensen-Shannon divergence. τ is the temperature parameter and plays a role in controlling the strength of penalties on the task-oriented counterfactual negative views. The larger the τ is, the smaller the influence of similarity between original ego-graphs and the task-oriented counterfactual negative views is. Scalability. Similar to BGRL (Thakoor et al., 2021) , this contrastive loss doesn't need to contrast inbatch negative samples and has time and space complexities scaling linearly to the batch size. Thus, G-CENSOR can get rid of performance-memory trade-off inherent in most prior GCL methods.

4.4. JOINT LEARNING OF GNN AND CTM

For simplicity and efficiency, we empirically share ϕ with ϕ + and ϕ -, and jointly learn ϕ and φ with a weight parameter β for the contrastive learning. Therefore, a final joint loss can be formulated as: L joint (ϕ, φ) = L pred (ϕ) + L cf (φ | ϕ) + βL cl (ϕ). 5 EXPERIMENTS To answer the Question 2, we conduct extensive experiments on eight real-world datasets from two perspectives: task performance and model generalizability.

5.1. EXPERIMENTAL SETUP

Dataset. Eight open benchmark datasets across three domains are investigated: (a) five citation networks (Bojchevski & Günnemann, 2018) including CoraFull, CoraML, CiteSeer, DBLP and PubMed. (b) two product networks (Shchur et al., 2018) including Computers and Photo. (3) an image network (Zeng et al., 2020) Flickr. See Appendix E.1 for more details of datasets. Baselines. Three types of baselines are compared: (a) base models, i.e., SAGE, GAT and GIN. (b) GCL methods with uniform or adaptive data augmentation, i.e., MVGRL (Hassani & Ahmadi, 2020) , GRACE (Zhu et al., 2020) , GCA (Zhu et al., 2021) and BGRL (Thakoor et al., 2021) . (c) GCL methods with learnable data augmentation, i.e., AUTOGCL (Yin et al., 2022) and RGCL (Li et al., 2022) . For more details, refer to Appendix E.2. Implementation. For a fair comparison, the architecture of base GNNs and batch training setting were the same for all methods. All the experiments were run 5 times with random seeds from 0 to 4. For more implementation details, refer to Appendix E.3.

5.2. IMPROVEMENT ON PERFORMANCE

As shown in Table 2 , G-CENSOR can significantly enhance the performance of base GNNs (best in 23 out of 24 settings). Specifically, G-CENSOR improves the accuracy of the base GNNs from 0.4% (GraphSAGE on CiteSeer) to 20.67% (GraphSAGE on CiteSeer). Considering all settings, the average gain of G-CENSOR is around 4.6%. Morever, G-CENSOR can consistently outperform the SOTA methods from 0.06% (GAT on Computers) to 14.59% (GIN on Flickr) except for setting of GAT on Flickr, where G-CENSOR got the second best performance. The average gain of G-CENSOR against SOTA is 1.87%. Note that on datasets like Photo, though the SOTA methods have already achieved a high performance (>93%), G-CENSOR can still pushes the boundary forward (>94%). All these results can prove the ability of G-CENSOR to enhance model performance. Meanwhile, GCA, an approach that adopts adaptive data augmentation and thus enables models learn important structures, outperforms other baselines in most settings. This may imply the effectiveness of G-CENSOR to auto-select the task-oriented positive structures. As for learnable data augmentation methods, i.e., AUTOGCL and RGCL, they didn't achieve satisfactory performance on most settings probably because of the task shift from graph classification to node classification.

5.3. IMPROVEMENT ON GENERALIZABILITY

To verify the ability of G-CENSOR to boost the generalizability of GNNs, we further conduct experiments on out-of-distribution data split setting based on the confounder discussed in 3, i.e., the ego. For each dataset, we first run a NODE2VEC (Grover & Leskovec, 2016) to get nodes' embeddings and cluster the nodes to two clusters by K-MEANS (Lloyd, 1982) . The cluster with larger sample size is randomly divided into training and validation sets and the other is regarded as testing set. As shown in Table 3 , G-CENSOR still significantly enhanced the performance of base GNNs up to 18.26% with an average improvement of 6.02% and consistently outperformed the SOTA methods up to 7.40% with an average improvement of 1.62%. While a larger performance degradation suggests a larger distribution gap between the test set and the training/validation set, it's observed that base GNNs can benifit more from G-CENSOR compared to SOTA methods. For example, Cora Full, Photo and Flickr are the three datasets with the most significant performance degradation but G-CENSOR outperformed the best SOTA the most on these three datasets. All these results demonstrate the superiority of G-CENSOR to enhance the generalizability of various GNNs.

5.4. SENSITIVITY ANALYSIS

G-CENSOR's sensitivity w.r.t. hyperparameters α, δ, τ and β (with GAT as base model) is evaluated by presenting the median performances in Figure 5 . It's observed that (a) α plays a crucial role in G-CENSOR. Different tasks prefer to different α, e.g., the best α is around 0.1 on Cora but is around 0.005 on Photo. And increasing α constantly can potentially hurt the performance, e.g., the accuracy tends to go down after α exceeds 0.005 on Flickr. This is reasonable since α implies a prior proportion of the causally-correlated part in a particular network as discussed in Equation 9. (b) G-CENSOR is relatively robust to β. It's seen that G-CENSOR can improve base models' performance under various values of β. For sensitivity analysis with GraphSAGE and GIN, refers to Appendix F. 

6. CONCLUSION

This paper proposes a novel graph contrastive learning framework G-CENSOR, an approach leveraging task-oriented counterfactual views generation, to enhance the performance and generalizability of GNNs on node property prediction tasks without any change on the model structure and the inference flow. Through extensive experiments with in-depth analysis, we demonstrate the superiority of G-CENSOR. However, the counterfactual data synthesis can further be improved based on counterfactual inference, i.e., the three steps of abduction, action and prediction in a structural causal model (Pearl, 2009) . We will explore it in the future work.

A.2 CAUSAL DIAGRAMS FOR K-HOP EGO-GRAPH GENERATION PROCESS

Let us explain a k-hop ego-graph generation process by a hierarchical traversal allowing revisiting a node or an edge. The causal diagrams for k-hop ego-graph are shown in Figure 6 . Figure 6a . In this case we assume an edge must have at least one path to the ego, where edges on the path are all causal w.r.t a task, when the edge itself is an causal edge w.r.t. the task. This ensures that the causal ego-subgraph is a connected graph. Now if the ego-graph is acyclic, there are three type of edges: (a) edges whose path(s) to the ego only include causal edges (i.e., edges indicated by the arrows in the upper row), (b) edges whose path(s) to the ego include no causal edges (i.e., edges indicated by the arrows in the bottom row), and (c) edges whose path(s) to the ego include both causal and non-causal edges (i.e., edges formed by the slanted arrows). Edges of the first type can be causally-correlated since they determine the label y, but edges of the second type can be spuriously-correlated since the association path from them to the label y exist fork (confounding) patterns joined by v. As for edges of the last type, they and label y can be also confounded by V i,1 , where i ∈ {0, 1, . . . , k -1}. If the ego-graph has cycles, which means there exists at least one edge can be included in multiple E i,j , where j ∈ {0, 1}. This edge is causally-correlated to the label y if and only if max(j) = 1, which indicates that this edge is part of causes of the label y, otherwise it is spuriously-correlated to the label y since it can be confounded by v like other normal edges in E i,0 . Figure 6b . Actually high-order edges in this case has no causal relationship to the label y and they can be confounded by either v or V 1,1 . 

B BASE MODELS

GraphSAGE (abbreviated as SAGE). Hamilton et al. Hamilton et al. (2017) proposed three variants and we utilize the simple but popular version, SAGE-mean, which directly aggregates neighbors by averaging the embeddings of them. In particular, the strategy of SAGE-mean is defined as follows: h k v ← σ W k • h k-1 v Mean {h k-1 v } ∪ {h k-1 u , ∀u ∈ N (v)} , where W k is the linear transformation weight matrix in the k-th layer, h k v is the embedding of node v in the k-th layer and N (v) is the neighbors of node v.

GAT.

A graph attention network model employs a multi-head attention mechanism on the aggregation of neighbors' features, which enables specifying different weights to different neigh-bors Veličković et al. (2018) . The aggregation strategy is defined as: h k v ← M m=1 σ 1   ∀u∈{v}∪N (v) α mk vu W mk h k-1 u   , where σ 1 is the ELU Clevert et al. (2016) function, W mk is the linear transformation weight matrix of m-th head in the k-th layer, and α mk vu = exp σ 2 a mk W mk h k-1 v ∥W mk h k-1 u ∀i∈{v}∪N (v) exp σ 2 a mk W mk h k-1 v ∥W mk h k-1 i , where σ 2 is the LeakyReLU Maas et al. (2013) function and a mk is the weight vector of the m-th head in the k-th layer. GIN. Graph Isomorphism Network, abbreviated as GIN, captures graph structure differences by summing neighbors' features, which provably maps any two graphs that the Weisfeiler-Lehman test Leman & Weisfeiler (1968) of isomorphism decides as non-isomorphic, to different embeddings Xu et al. (2019) . The updating strategy is shown below h k v ← MLP k   1 + ϵ k h k-1 v + ∀u∈N (v) h k-1 u   , where MLP k is a multi-layer perceptrons Hornik et al. (1989) for the k-th layer, and ϵ k can be a learnable parameter or a fixed scalar in the k-th layer.

C REPARAMETERIZATION TRICK

Following (Maddison et al., 2017) , denoting 1 m m i=1 MLP i e uς|Gv ; Λ i in Equation ( 4) by s uς|Gv , in the training stage, the probability of the edge (u, ς) being part of E + v is given by θ train uς|Gv = S log(ϵ) -log(1 -ϵ) + s uς|Gv λ , where S(x) = 1 1+e -x and ϵ ∼ U(0, 1) is an independent random variable that obeys a standard uniform distribution. λ is the temperature parameter to control the approximation. When λ → 0, θ train uς|Gv is binarized with lim τ →0 P θ train uς|Gv = 1 = exp s uς|Gv 1 + exp s uς|Gv .

D REAL RUNTIME EVALUATION

The real runtime of a model on a dataset is calculated by averaging the minimum one-epoch runtime of the three base GNNs at all settings on the dataset (As shown in Table 4 ). Note that all the experiments were conducted on the same platform described in Appendix E.3. It's seen that G-CENSOR actually achieves a very fast speed compared to other baselines. This should be attributed to the two reasons mentioned before, i.e., simple edge probability estimator and no need to contrast in-batch samples.

E EXPERIMENTAL SETUP E.1 DATASETS

In the five citation networks, i.e., Cora Full, Cora ML, CiteSeer, DBLP and PubMed, nodes represent papers and edges represent citation links. Given paper text as bag-of-words node features, the task is to predict the topic of a paper. In the two product networks, i.e., Computers and Photo, nodes represent products and edges represent that two goods are frequently bought together. Given product reviews as bag-of-words node features, the task is to map goods to their respective product category. In the image network, i.e., Flickr, nodes represent images and edges represent that two images share some common properties (e.g., same geographic location and comments by the same user, etc.). Given bag-of-word representation of the images as node features, the task is to predict the type of an image. GCL methods with uniform or adaptive data augmentation: 1. MVGRL (Hassani & Ahmadi, 2020) : Multi-View Graph Representation Learning, an approach contrasting encodings from first-order neighbors and a general graph diffusion and also contrasting node and graph encodings across views. 2. Grace (Zhu et al., 2020) : GRAph Contrastive rEpresentation learning, an approach generating two graph views by corruption and learn node representation by maximizing the agreement of node representations in these two views. 3. GCA (Zhu et al., 2021) : Graph Contrastive representation learning with Adaptive augmentation, an approach designing augmentation scheme based on node centrality measures to highlight important connective structures. 4. BGRL (Thakoor et al., 2021) : Bootstrapped Graph Latents, an graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. GCL methods with learnable data augmentation: 1. RGCL (Li et al., 2022) : Rationale-aware Graph Contrastive Learning, an unsupervised approach using a rationale generator to reveal salient structures about graph instancediscrimination as the rationale, and then creating rationale-aware views for contrastive learning. Note that this method, designed for graph property prediction tasks, integrates the views generation module and the inference flow of the predictor. Therefore, we regard node property prediction tasks as ego-graph property prediction tasks to adapt to this method. 2. AutoGCL (Yin et al., 2022) : Automated Graph Contrastive Learning, an approach employing a set of learnable graph view generators orchestrated by an auto augmentation strategy, where every graph view generator learns a probability distribution of graphs conditioned by the input. This method is proposed for graph property prediction tasks. However, it can be directly transferred to node property tasks since its views generator and task predictor are separates. Note that while our work can be easily enhanced by considering node feature transformation in views generation, we focus on structure transformation in this work, thus feature transformation is disabled in all models including G-CENSOR.

E.3 IMPLEMENTATION DETAILS

Experiments were conducted on a Ubuntu 18.04 server with one Nvidia Tesla V100-32G GPU. And the code was implemented using python 3.8 with PyG 2.0.4 and Pytorch 1.11 that used CUDA version 11.3. For all datasets, the number of sampled neighbors was set to 64. The batch size was set to 64 for all models and an AdamW optimizer (Loshchilov & Hutter, 2019) with learning rate 0.01 was used to train all models. For a fair comparison, the number of layers of base GNNs was set to 2 for all baselines and all contrastive baselines were used as an auxiliary task (Xie et al., 2022) . The weight of the contrastive loss was searched from 0.1 to 0.9. Moreover, for all baselines with hyperparameters of edge drop probabilities and temperature, we searched edge drop probabilities over [0.1, 0.2, 0.3, 0.4] and searched temperatures over [0.1, 0.2], unless the original paper reported the best choices on the datasets. As for G-CENSOR, m in Equation 4 was simply set to 4, α in Equation 9was searched from 1e-5 to 1e-1, δ margin in Equation 7 was searched over [0.5, 0.1], τ in Equation 10 was searched over [0.05, 0.1, 0.15, 0.2], and β in Equation 11 was searched from 0.1 to 0.5.

F SENSITIVITY ANALYSIS

This section displays the sensitivity analysis on hyperparameters for GraphSAGE and GIN.



(a) Graph structure determines label.(b) Label of ego causes graph structure.

Figure2: Causal diagrams(Pearl, 2009) for the 1-hop ego-graph generation process. A solid arrow from a to b represents a causal relationship from a to b. The two dashed lines with text of GNN represent that the whole observed graph (G) is used to predict task label (y). (a) The attributes of the ego (v) can cause two kinds of edges: E 0 (with nodes V 0 ) and E 1 (with nodes V 1 ). Only the structure of (E 1 ) determines the label (y). (b) The label (y) is an inherited property of the ego (v) and this property causes E 1 (with nodes V 1 ). Meanwhile, other properties of v would cause (E 0 ). For interpretative examples, refer to Appendix A.1; for the causal diagrams of k-hop (k > 1) ego-graph generation process, refer to Appedix A.2.

Figure 3: The overall architecture of the G-CENSOR. The solid lines represent a normal flow to train a GNN ϕ with the raw ego-graphs by minimizing a general prediction loss (Section 4.1). The dashed lines represent the flow to train a counterfactual transformation module φ that generates task-oriented counterfactual positive/negative views by minimizing a designed counterfactual loss (Section 4.2). The dotted line is conducting graph contrastive learning with task-oriented counterfactual views (Section 4.3). We jointly optimize the three objectives to learn ϕ and φ (Section 4.4).

Figure 4: Counterfactual Transformation Module. (a) Solid blue square is original node feature x, dashed gray square is landing probability of random walk from the ego to a node, i.e., r, and dotted pink square is whether a node is the ego, i.e., 1 u=v . (b) The estimation of the probability that an edge belongs to G + . (c) Sampling G + according to the estimated probability of all the edges and constructing G -based on G + . Only original node feature x keeps in both G + and G -.

Figure 5: Sensitivity analysis on hyperparameters with GAT as base model.

Figure6: Causal diagrams(Pearl, 2009) for the k-hop (k > 1) ego-graph generation process. A solid arrow from a to b represents a causal relationship from a to b. The two dashed line with text GNN mean that we usually use the whole observed graph (G) to predict task label y.

The comparison between G-CENSOR and the other state-of-the-art GCL models.

and a task label y vi ∈ {1, . . . , c} are assigned, where d is feature dimension and c is the number of classes. A k-hop ego-graph (Daly &

Comparison in node classification accuracy on test set with independent and identical distribution. Bold and underline indicate the best and second performance, respectively.

Comparison in node classification accuracy on out-of-distribution test set. Bold and underline indicate the best the second best performance, respectively. ∆ represents the average accuracy degradation of base GNNs on the testing set compared to the accuracy on the validation set. ↑ represents the average gain of G-CENSOR against the best SOTA GCL methods.

Real runtime (seconds per epoch) of all the methods on all the datasets.

Datasets statistics

A.1 INTERPRETATIVE EXAMPLES FOR 1-HOP EGO-GRAPH

Example 1 (For Figure 2a ). Assuming this is a paper impact (y) prediction task defined on a citation graph. Generally, the type of v affects the graph structure composed of cited papers (E 1 with V 1 ), e.g., survey papers are more likely to be cited than general papers. Meanwhile the type of v affects the graph structure composed of references (E 0 with V 0 ), e.g., survey papers have more references than general papers. While being cited more (E 1 ) stably indicating higher impact (y), a GNN use all observed structures (G) is prone to learn that papers with more reference (E 0 ) tend to get higher impact (y), and this relationship can be considered as a spurious correlation.Example 2 (For Figure 2b ). Assuming this is an image category (y) prediction task, e.g. landscape and city, defined on an image graph. In this graph, an edge exists if two images share some common properties, e.g., location, producer and object (McAuley & Leskovec, 2012) . (E 1 with V 1 ) can be images with same objects (related to category) and (E 0 with V 0 ) can be images with same producers or locations. While category (y) stably indicating specific objects thus generate connections between (y) and (E 1 ), a GNN use all observed structures (G) may learn a joint probability distribution on category (y) and images sharing same producers or locations (E 0 ). This joint probability distribution can be considered as a spurious correlation.Example 3 (For Assumption 1). In Example 1, all highly-cited papers are necessarily considered as high-impact papers, but there can exist a high-impact paper without many references. Similarly, in Example 2, an image must have connections to those images that belong to the same category, but an landscape image can be produced by a photographer who often takes pictures of human models. 

