ADVERSARIAL CAUSAL AUGMENTATION FOR GRAPH COVARIATE SHIFT Anonymous

Abstract

Out-of-distribution (OOD) generalization on graphs is drawing widespread attention. However, existing efforts mainly focus on the OOD issue of correlation shift. While another type, covariate shift, remains largely unexplored but is the focus of this work. From a data generation view, causal features are stable substructures in data, which play key roles in OOD generalization. While their complementary parts, environments, are unstable features that often lead to various distribution shifts. Correlation shift establishes spurious statistical correlations between environments and labels. In contrast, covariate shift means that there exist unseen environmental features in test data. Existing strategies of graph invariant learning and data augmentation suffer from limited environments or unstable causal features, which greatly limits their generalization ability on covariate shift. In view of that, we propose a novel graph augmentation strategy: Adversarial Causal Augmentation (AdvCA), to alleviate the covariate shift. Specifically, it adversarially augments the data to explore diverse distributions of the environments. Meanwhile, it keeps the causal features stable across diverse environments. It maintains the environmental diversity while ensuring the invariance of the causal features, thereby effectively alleviating the covariate shift. Extensive experimental results with in-depth analyses demonstrate that AdvCA can outperform 14 baselines on synthetic and real-world datasets with various covariate shifts.



Graph learning mostly follows the assumption that training and test data are independently drawn from an identical distribution. Such an assumption is difficult to be satisfied in the wild, due to out-of-distribution (OOD) issues (Shen et al., 2021) , where the training and test data are from different distributions. Hence, OOD generalization on graphs is attracting widespread attention (Li et al., 2022b) . However, existing studies mostly focus on correlation shift, which is just one type of OOD issue (Ye et al., 2022; Wiles et al., 2022) . While another type, covariate shift, remains largely unexplored but is the focus of our work. Covariate shift is in stark contrast to correlation shift w.r.t. causal and environmental features of datafoot_0 . Specifically, from a data generation view, causal featuresfoot_1 are the substructures of the entire graphs that truly reflect the predictive property of data, while their complementary parts are the environmental features that are noncausal to the predictions. Following prior studies (Arjovsky et al., 2019; Wu et al., 2022b) , we assume causal features are stable across distributions, in contrast to the environmental features. Correlation shift denotes that environments and labels establish inconsistent statistical correlations in training and test data; whereas, covariate shift means that the environmental features in test data are unseen in training data (Ye et al., 2022; Wiles et al., 2022; Gui et al., 2022) . For example, in Figure 1 , the environmental features ladder and tree are different in training and test data, which forms the covariate shift (↔). Taking molecular property predictions as another example, functional groups (e.g., nitrogen dioxide (NO 2 )) are causal features that determine the predictive property of molecules. While scaffolds (e.g., carbon rings) are irrelevant patterns (Wu et al., 2018) , which can be seen as the environments. In practice, we often need to use molecular graphs collected in the past to train models, hoping that the models can predict the properties of molecules with new scaffolds in the future (Hu et al., 2020) . Because of the differences between correlation and covariate shifts, we take a close look at the existing efforts on graph generalization. Existing efforts (Li et al., 2022b) mainly fall into the following research lines, each of which has inherent limitations to solve covariate shift. • Invariant graph learning (Wu et al., 2022b; Liu et al., 2022; Sui et al., 2022) gradually becomes a prevalent paradigm for OOD generalization. The main idea is to capture the causal features by minimizing the empirical risks within different environments. Unfortunately, it implicitly makes a prior assumption that all test environments are available during training. This assumption is unrealistic owing to the obstacle of training data covering all possible test environments. Learning in limited environments can only alleviate the spurious correlations that are hidden in the training data, but fail to extrapolate test distributions with unseen environments. • Graph data augmentation (Ding et al., 2022; Zhao et al., 2022) perturbs graph features to enrich the distribution seen during training for better generalization. It can be roughly divided into node-level (Kong et al., 2022) , edge-level (Rong et al., 2020) , and graph-level (Wang et al., 2021; Han et al., 2022) with random (You et al., 2020) or adversarial strategies (Suresh et al., 2021) . However, they are prone to destroy the causal features, which easily loses control of the perturbed distributions. For example, in Figure 1 , the random strategy of DropEdge (Rong et al., 2020) will inevitably perturb the causal features (highlighted by red circles). As such, it fails to alleviate the covariate shift (↔), even degenerating the generalization ability. Scrutinizing the limitations of the aforementioned studies, insufficient environments and unstable causal features largely hinder the ability of these generalization efforts against the covariate shift. Hence, we naturally ask a question: "Can the augmented samples simultaneously preserve the diversity of environmental features and the invariance of causal features?" Towards this end, we first propose two principles for graph augmentation: environmental diversity and causal invariance. Specifically, environmental diversity encourages the augmentation to extrapolate unseen environments; meanwhile, causal invariance shortens the distribution gap between the augmented data and test data. To achieve these principles, we design a novel graph augmentation strategy: Adversarial Causal Augmentation (AdvCA). Specifically, we augment the graphs by a network, named adversarial augmenter. It adversarially generates the masks on edges and node features, which makes OOD exploration for improving the environmental diversity. To maintain the stability of the causal features, we adopt another network, named causal generator. It generates the masks that capture causal features. Finally, we delicately combine these masks and apply them to graph data. As shown in Figure 1 , AdvCA only perturbs the environmental features, while keeping the causal parts untorched. Our quantitative experiments also verify that AdvCA can narrow the distribution gap between the augmented data and test data, as illustrated in Figure 1 (↔), thereby effectively overcoming the covariate shift issues. Our contributions can be summarized as: • Problem: We are exploring one specific type of OOD issue in graph learning: covariate shift, which is of great need but largely unexplored. 

2. PRELIMINARIES

In this section, we first give the formal definitions of causal features, environmental features, and graph covariate shift. Then we present the problem of graph classification under covariate shift.

2.1. NOTATIONS

We define the uppercase letters (e.g., G) as random variables. The lower-case letters (e.g., g) are samples of variables, and the blackboard bold typefaces (e.g., G) denote the sample spaces. Let g = (A, X) ∈ G denote a graph, where A and X are its adjacency matrix and node features, respectively. It is assigned with a label y ∈ Y with a fixed labeling rule G → Y. Let D = {(g i , y i )} denote a dataset that is divided into a training set D tr = {(g e i , y e i )} e∈Etr and a test set D te = {(g e i , y e i )} e∈Ete . E tr and E te are the index sets of training and test environments, respectively. In this work, we focus on graph classification scenario, which aims to train models with D tr and infer the labels in D te .

2.2. DEFINITIONS AND PROBLEM FORMATIONS

Following studies (Arjovsky et al., 2019; Wu et al., 2022b) , we assume that the inner mechanism of the labeling rule G → Y usually depends on the causal features, which are particular subparts of the entire data. Causal invariance denotes that the relationship between the causal feature and label is invariant across different environments or distributions, which makes OOD generalization possible (Ye et al., 2022) . While the complement of causal parts, environmental features, are noncausal for predicting the graphs. Now we give a formal definition of these features. Assumption 1 (Causal & Environmental Feature) Assume input graph G containing two features G cau , G env , and they satisfy: G cau ∪ G env = G. If they obey the following conditions: i) (sufficiency condition) P (Y |G cau ) = P (Y |G); ii) (independence condition) Y G env | G cau , then we define G cau and G env as the causal feature and environmental feature, respectively. Sufficiency condition requires that causal features should be sufficient to preserve the critical information of data G related to the label Y . While the independence indicates that causal feature can shield the label from the influence of the environment feature. It makes causal features establish an invariant relationship with labels across different environments. Hence, distribution shifts are only caused by the environmental features rather than causal features. Recent studies (Ye et al., 2022; Gui et al., 2022) have pointed out that OOD issue can be specifically divided into correlation shift and covariate shift. Since we mainly focus on the latter, we put detailed discussions between them in Appendix C. Now we give a formal definition of the covariate shift on graphs. Definition 1 (Graph Covariate Shift) Let P tr and P te denote the probability functions of the training and test distributions. We measure the covariate shift between distributions P tr and P te as GCS(P tr , P te ) = 1 2 ∫ S |P tr (g) -P te (g)|dg, where S = {g ∈ G|P tr (g) ⋅ P te (g) = 0}, which covers the features (e.g., environmental features) that do not overlap between the two distributions. GCS(P tr , P te ) is always bounded in [0, 1]. The issue of graph covariate shift is very common in practice. For example, the chemical properties of molecules are mainly determined by specific functional groups, which can be regarded as causal features to predict these properties (Arjovsky et al., 2019; Wu et al., 2022b) . While their scaffold structures (Wu et al., 2018) , which are often irrelevant to their properties, can be seen as environmental features. In practice, we often need to train models on past molecular graphs, and hope that the model can predict the properties of future molecules with novel scaffolds (Hu et al., 2020) . Hence, this work focuses on the covariate shift issues. Now we give a formal definition of this problem as follows. Problem 1 (Graph Classification under Covariate Shift) Given the training and test sets with environment sets E tr and E te , they follow distributions P tr and P te , and they satisfy: GCS(P tr , P te ) > 0. We aim to use the data collected from training environments E tr , and learn a powerful graph classifier f * ∶ G → Y that performs well in all possible test environments E te : f * = arg min f sup e∈Ete E e [ℓ(f (g), y)], where E e [ℓ(f (g), y)] is the empirical risk on the environment e, and ℓ(⋅, ⋅) is the loss function. Problem 1 states that it is unrealistic for the training set to cover all possible environments in the test set. It means that we have to extrapolate unknown environments by using the limited training environments at hand, which makes this problem more challenging.

3. METHODOLOGY

In this section, we first propose two principles for graph data augmentation. Guided by these principles, we design a new graph augmentation strategy that can effectively solve Problem 1.

3.1. TWO PRINCIPLES FOR GRAPH AUGMENTATION

Scrutinizing Problem 1, we observe that the covariate shift is mainly caused by the scarcity of training environments. Existing efforts (Liu et al., 2022; Sui et al., 2022; Wu et al., 2022b ) make intervention or replacement of the environments to capture causal features. However, these environmental features still stem from the training distribution, which may result in a limited diversity of the environments. Worse still, if the environments are too scarce, the model will inevitably learn the shortcuts between these environmental features, resulting in suboptimal learning of the causal parts. To this end, we propose the first principle for data augmentation: Principle 1 (Environmental Diversity) Given a set of graphs {g} with distribution function P . Let T (⋅) denote an augmentation function that augments graphs {T (g)} to distribution function P . Then T (⋅) should meet GCS(P, P ) → 1. Principle 1 states that P should keep away from the original distribution P . Hence, the distribution of augmented data tends not to overlap with the original distribution, which encourages the diversity of environmental features. However, from a data generation perspective, causal features are stable and shared across environments (Kaddour et al., 2022) , so they are essential features for OOD generalization. Since Principle 1 does not expose any constraint on the invariant property of the augmented distribution, we here propose the second principle for augmentation: Principle 2 (Causal Invariance) Given a set of graphs {g} with a corresponding causal feature set {g cau = (A cau , X cau )}. Let T (⋅) denote an augmentation function that augments graphs {T (g)} with a corresponding causal feature set {g cau = ( Ãcau , Xcau) }. Then T (⋅) should meet E[∥A cau - Ãcau ∥ 2 F ] → 0 and E[∥X cau -Xcau ∥ 2 F ] → 0, where ∥ ⋅ ∥ 2 F is the Frobenius norm. Principle 2 emphasizes the invariance of the graph structures and node features in causal parts after data augmentation. As illustrated in Figure 1 , Principle 1 keeps the distribution of augmented data away from the training distribution; meanwhile, Principle 2 restricts the distribution of augmented data not too far from the test distribution. These two principles complement each other, and further cooperate together to alleviate the covariate shift.

3.2. OUT-OF-DISTRIBUTION EXPLORATION

Given a GNN model f (⋅) with parameters θ, we decompose f = Φ ○ h, where h(⋅) ∶ G → R d is a graph encoder to yield d-dimensional representations, and Φ(⋅) ∶ R d → Y is a classifier. To comply with Principle 1, we need to do OOD exploration. Inspired by distributionally robust optimization (Sagawa et al., 2020) , we consider the following optimization objective: min θ { sup P {E P [ℓ(f (g), y)] ∶ D( P , P ) ≤ ρ}} , where P and P denote the original and explored data distributions, respectively. D(⋅, ⋅) is a distance metric between two probability distributions. The solution to Equation 3 guarantees the generalization within a distance ρ of the distribution P . To better measure the distance between distributions, as suggested by Sinha et al. (2018) , we adopt the Wasserstein distance (Arjovsky et al., 2017; Volpi et al., 2018) as the distance metric. The distance metric function can be defined as: D( P , P ) ∶= inf µ∈Γ( P ,P ) E µ [c(g, g)], where Γ( P , P ) is the set of all couplings of P and P , c(⋅, ⋅) is the cost function. Studies (Dosovitskiy & Brox, 2016; Volpi et al., 2018) also suggest that the distances in representation space typically correspond to semantic distances. Hence, we define the cost function in the representation space and give the following transportation cost: c(g, g) = ∥h(g) -h(g)∥ 2 2 . ( ) Adversarial Augmenter Causal Generator Input Graph Augmented Graphs • • • • • • Training time • • • Keeping Causal Features Invariant Keeping Environmental Features Diverse GNN Classifier • • • Figure 2: The overview of Adversarial Causal Augmentation (AdvCA) Framework. It denotes the "cost" of augmenting the graph g to g. We can observe that it is difficult to set a proper ρ in Equation 3. Instead, we consider the Lagrangian relaxation for a fixed penalty coefficient γ. Inspired by Sinha et al. (2018) , we can reformulate Equation 3 as follows: min θ { sup P {E P [ℓ(f (g), y)] -γD( P , P )} = E P [ϕ(f (g), y)]} , where ϕ(f (g), y) ∶= sup g∈G {ℓ(f (g), y)γc(g, g)}. And we define ϕ(f (g), y) as the robust surrogate loss. If we conduct gradient descent on the robust surrogate loss, we will have: ∇ θ ϕ(f (g), y) = ∇ θ ℓ(f (g * ), y), where g * = arg max g∈G {ℓ(f (g), y) -γc(g, g)}. ( ) g * is an augmented view of the original data g. Hence, to achieve OOD exploration, we just need to perform data augmentation via Equation 8 on the original data g.

3.3. ADVERSARIAL CAUSAL AUGMENTATION

Equation 8 endows the ability of OOD exploration to data augmentation, which makes the augmented data meet Principle 1. In addition, to achieve Principle 2, we also need to implement causal feature learning based on the sufficiency and independence conditions in Assumption 1. Hence, we design a novel graph augmentation strategy: Adversarial Causal Augmentation (AdvCA). The overview of the proposed framework is depicted in Figure 2 , which mainly consists of two components: adversarial augmenter and causal generator. Adversarial augmenter achieves OOD exploration through adversarial data augmentation, which encourages the diversity of environmental features; meanwhile, the causal generator keeps causal feature invariant by identifying causal features from data. Below we elaborate on the implementation details. Adversarial Augmenter & Causal Generator. We design two networks, adversarial augmenter T θ1 (⋅) and causal generator T θ2 (⋅), which generate masks for nodes and edges of graphs. They have the same structure and are parameterized by θ 1 and θ 2 , respectively. Given an input graph g = (A, X) with n nodes, mask generation network first obtains the node representations via a GNN encoder h(⋅). To judge the importance of nodes and edges, it adopts two MLP networks MLP 1 (⋅) and MLP 2 (⋅) to generate the soft node mask matrix M x ∈ R n×1 and edge mask matrix M a ∈ R n×n for graph data, respectively. In summary, the mask generation network can be decomposed as: Z = h(g), M x i = σ(MLP 1 (h i )), M a ij = σ(MLP 2 ([z i , z j ])), where Z ∈ R n×d is node representation matrix, whose i-th row z i = Z[i, ∶] denotes the representation of node i, and σ(⋅) is the sigmoid function that maps the mask values M x i and M a ij to [0, 1]. Adversarial Causal Augmentation. To estimate g * in Equation 8, we define the adversarial learning objective as: max θ1 {L adv = E Ptr [ℓ(f (T θ1 (g)), y) -γc(T θ1 (g), g)]} . ( ) Then we can augment the graph by T θ1 (g) = (A ⊙ M a adv , X ⊙ M x adv ), where ⊙ is the broadcasted element-wise product. Although adversarially augmented graphs guarantee environmental diversity, it inevitably destroys the causal parts. Therefore, we utilize the causal generator T θ2 (⋅) to capture causal features and combine them with diverse environmental features. Following the sufficiency and independence conditions in Assumption 1, we define the causal learning objective as: min θ,θ2 {L cau = E Ptr [ℓ(f (T θ2 (g)), y) + ℓ(f (g), y)]} , where g = (A ⊙ Ma , X ⊙ Mx ) is the augmented graph. It adopts the mask combination strategy: Ma = (1 a -M a cau ) ⊙ M a adv + M a cau and Mx = (1 x -M x cau ) ⊙ M x adv + M x cau , where M a cau and M x cau are generated by T θ2 (⋅), 1 a and 1 x are all-one matrices, and if there is no edge between node i and node j, then we set 1 a ij to 0. Now we explain this combination strategy. Taking Mx as an example, since M x cau denotes the captured causal regions via T θ2 (⋅), 1 x -M x cau represents the complementary parts, which are environmental regions. M x adv represents the adversarial perturbation, so (1 x -M x cau ) ⊙ M x adv is equivalent to applying the adversarial perturbation on environmental features, meanwhile, sheltering the causal features. Finally, +M cau indicates that the augmented data should retain the original causal features. Hence, this combination strategy achieves both environmental diversity and causal invariance. Inspecting Equation 11, the first term indicates that causal features are enough for predictions, thus satisfying the sufficiency condition. While the second term encourages causal features to make right predictions after perturbing the environments, thereby satisfying the independence condition. Regularization. For Equation 10, the adversarial optimization tends to remove more nodes and edges, so we should also constrain the perturbations. Although Equation 11 satisfies the sufficiency and independence conditions, it is necessary to impose constraints on the ratio of the causal features to prevent trivial solutions. Hence, we first define the regularization function r(M, k, λ) = (∑ ij M ij /k -λ) + (∑ ij I[M ij > 0]/k -λ), where k is the total number of elements to be constrained, I ∈ {0, 1} is an indicator function. The first term penalizes the average ratio close to λ, while the second term encourages an uneven distribution. Given a graph with n nodes and m edges, we define the regularization term for adversarial augmentation and causal learning as: L reg1 = E Ptr [r(M x adv , n, λ a ) + r(M a adv , m, λ a )], L reg2 = E Ptr [r(M x cau , n, λ c ) + r(M a cau , m, λ c )], where λ c ∈ (0, 1) is the ratio of causal features, we set λ a = 1 for adversarial learning, which can alleviate excessive perturbations. The detailed algorithm of AdvCA is provided in Appendix A.1.

4. EXPERIMENTS

In this section, we conduct extensive experiments to answer the following Research Questions: • RQ1: Compared to existing efforts, how does AdvCA perform under covariate shift? • RQ2: Can the proposed AdvCA achieve the principles of environmental diversity and causal invariance, thereby effectively alleviating the covariate shift? • RQ3: How do the different components of AdvCA affect performance? 4.1 EXPERIMENTAL SETTINGS Datasets. We use graph OOD datasets (Gui et al., 2022) and OGB datasets (Hu et al., 2020) , which include Motif, CMNIST, Molbbbp and Molhiv. Following Gui et al. (2022) , we adopt the base, color, size and scaffold data splitting to create various covariate shifts. The details of the datasets, metrics and implementation details of AdvCA are provided in Appendix A.2 and A.3. Baselines. We adopt 14 baselines, which can be divided into the following three specific categories: • Generalization Algorithms: Empirical Risk Minimization (ERM), IRM (Arjovsky et al., 2019) , GroupDRO (Sagawa et al., 2020) , VREx (Krueger et al., 2021) . • Graph Invariant Learning and Generalization: DIR (Wu et al., 2022b) , CAL (Sui et al., 2022) , GSAT (Miao et al., 2022) , OOD-GNN (Li et al., 2022a) , StableGNN (Fan et al., 2021) . • Graph Data Augmentation: DropEdge (Rong et al., 2020) , GREA (Liu et al., 2022) , FLAG (Kong et al., 2022) , M-Mixup (Wang et al., 2021) , G-Mixup (Han et al., 2022) . 

4.2. MAIN RESULTS (RQ1)

To demonstrate the superiority of AdvCA, we first make comprehensive comparisons with baseline methods. The implementation settings and details of baselines are provided in Appendix A.4. All experimental results are summarized in Table 1 . We have the following Observations. Obs1: Most generalization and augmentation methods fail under covariate shift. Generalization and data augmentation algorithms perform well on certain datasets or shifts. VREx achieves a 2.81% improvement on Motif (base). For two shifts of Molhiv, data augmentation methods GREA and DropEdge obtain 1.20% and 0.77% improvements. The invariant learning methods DIR and CAL also obtain 4.60% and 1.53% improvements on CMNIST and Molbbbp (size). Unfortunately, none of the methods consistently outperform ERM. For example, GREA and DropEdge perform poorly on Motif (base), ↓11.92% and ↓23.58%. DIR and CAL also fail on Molhiv. These show that both invariant learning and data augmentation methods have their own weaknesses, which lead to unstable performance when facing complex and diverse covariate shifts from different datasets. Obs2: AdvCA consistently outperforms all baseline methods. Compared with ERM, AdvCA can obtain significant improvements. For two types of covariate shifts on Motif, AdvCA surpasses ERM by 4.98% and 4.11%, respectively. In contrast to the large performance variances on different datasets achieved by baselines, AdaCA consistently obtains the leading performance across the board. For CMNIST, AdvCA achieves a performance improvement of 3.17% compared to the best baseline DIR. For Motif, the performance is improved by 2.17% and 1.72% compared to VREx and GREA. These results illustrate that AdvCA can overcome the shortcomings of invariant learning and data augmentation. Armed with the principles of environmental diversity and causal invariance, AdvCA achieves stable and consistent improvements on different datasets with various covariate shifts. In addition, although we focus on covariate shift in this work, we also carefully check the performance of AdvCA under correlation shift, and the results are presented in Appendix D.1.

4.3. COVARIATE SHIFT AND VISUALIZATIONS (RQ2)

In this section, we conduct quantitative experiments to demonstrate that AdvCA can shorten the distribution gap, as shown in Figure 1 . Specifically, we utilize GCS(⋅, ⋅) as the measurement to quantify the degree of covariate shift. The detailed estimation procedure is provided in Appendix B. To make comprehensive comparisons, we select four different types of covariate shift: base, size, color and scaffold, to conduct experiments. We choose three data augmentation baselines, DropEdge, FLAG and G-Mixup, which augment graphs from different views. The experimental results are shown in Table 2 . We calculated covariate shifts between the augmentation distribution P aug with the training P tr or test distribution P te . "Aug-Train" and "Aug-Test" represent GCS(P aug , P tr ) and GCS(P aug , P te ), respectively. From the results in Table 2 , we have the following observations. To verify the environmental diversity and causal invariance of AdvCA, we plot the augmented graphs in Figure 3 . These augmented graphs are randomly sampled during training. More visualizations are depicted in Appendix D.3. The Motif and CMNIST graphs are displayed in the first and second rows. Figure 3 (Left) shows the original graphs. For Motif, the green part represents the motif-graph, whose type determines the label. While the yellow part denotes the base-graph that contains environmental features. For CMNIST, the red subgraph contains causal features while the complementary parts contain environmental features. Figure 3 (Right) displays the augmented samples during training. Nodes with darker colors and edges with wider lines indicate higher soft-mask values. From these visualizations, we have the following observations. Obs4: AdvCA can achieve both environmental diversity and causal invariance. We can observe that AdvCA only perturbs the environmental features while keeping the causal parts invariant. For Motif dataset, the base-graph is a ladder and the motif-graph is a house. After augmentation, the nodes and edges of the ladder graph are perturbed. In contrast, the house part remains invariant and stable during training. The CMNIST graph also exhibits the same phenomenon. The environmental features are frequently perturbed, while the causal subgraph that determines label "2" remains invariant and stable during training. These visualizations further demonstrate that AdvCA can simultaneously guarantee environmental diversity and causal invariance.

4.4. ABLATION STUDY (RQ3)

Adversarial augmentation v.s. Causal learning. They are two vital components that achieve environmental diversity and causal invariance. The results are depicted in Figure 4 (Left). "w/o Adv" and "w/o Cau" refer to AdvCA without adversarial augmentation and without causal learning, respectively. RDCA stands for a variant that replaces the adversarial augmentation in AdvCA with random augmentation (i.e., random masks). Compared to AdvCA, utilizing either causal learning or adversarial augmentation alone will degrade the performance. On the one hand, removing adversarial perturbations loses the invariance condition in causal learning, leading to suboptimal causal features. On the other hand, using adversarial augmentation alone will destroy the causal features, thereby impairing generalization. RDCA exceeds ERM, but is worse than AdvCA, suggesting that randomness will also encourage diversity, even if it is less effective than the adversarial strategy. Sensitivity Analysis. The causal ratio λ c and penalty coefficient γ determine the extent of causal features and the strength of adversarial augmentation, respectively. We also study their sensitivities. The experimental results are shown in Figure 4 (Middle) and (Right). Dashed lines denote the performance of ERM. λ c with 0.3∼0.8 performs well on Motif and Molbbbp, while Molhiv is better in 0.1∼0.3. It indicates that the causal ratio is a dataset-sensitive hyper-parameter that needs careful tuning. For the penalty coefficient, the appropriate values on the three datasets range from 0.1∼1.5.

5. RELATED WORK

Invariant Causal Learning (Lu et al., 2021) exploits causal features for better generalization. IRM (Arjovsky et al., 2019) minimizes the empirical risks within different environments. Chang et al. (2020) minimize the performance gap between environment-aware and environment-agnostic predictors to discover rationales. Motivated by these efforts, DIR (Wu et al., 2022b) constructs multiple interventional environments for invariant learning. GREA (Liu et al., 2022) and CAL (Sui et al., 2022) learn causal features by challenging different environments. However, they only focus on correlation shift issues. The limited environments hinder their successes on covariate shift. Graph Data Augmentation (Ding et al., 2022; Zhao et al., 2022; Yoo et al., 2022) enlarges the training distribution by perturbing features in graphs. Recent studies (Ding et al., 2021; Wiles et al., 2022) observe that it often outperforms other generalization efforts (Arjovsky et al., 2019; Sagawa et al., 2020) . DropEdge (Rong et al., 2020) randomly removes edges, while FLAG (Kong et al., 2022) augments node features with an adversarial strategy. M-Mixup (Wang et al., 2021) interpolates graphs in semantic space. However, studies (Arjovsky et al., 2019; Lu et al., 2021) point out that causal features are the key to OOD generalization. These augmentation efforts are prone to perturb the causal features, which easily loses control of the perturbed distributions. Due to the space constraints, we put more discussions about OOD generalization in Appendix F.

6. CONCLUSION & LIMITATIONS

In this work, we focus on the graph generalization problem under covariate shift, which is of great need but largely unexplored. We propose a novel graph augmentation strategy, AdvCA, which is based on the principle of environmental diversity and causal invariance. Environmental diversity allows the model to explore more novel environments, thereby better generalizing to possible unseen test distributions. Causal invariance closes the distribution gap between the augmented and test data, resulting in better generalization. We make comprehensive comparisons with 14 baselines and conduct in-depth analyses and visualizations. The experimental results demonstrate that AdvCA can achieve excellent generalization ability under covariate shift. In addition, we also provide more discussions about the limitations of AdvCA and future work in Appendix G.

A IMPLEMENTATION DETAILS

A.1 ALGORITHM We summarize the detailed implementations of AdvCA in Algorithm 1. Inspired by Suresh et al. (2021) , we alternately optimize the adversarial augmenter and causal generator with the backbone model, in lines 13 and 14. We adopt the causal features for predictions in the inference stage. for each (g i , y i ) ∈ B tr do 5: Compute L adv -L reg 1 via Equation 10 and Equation 1212: M a adv , M x adv ← T θ1 (g i ) // adversarial perturbations 6: M a cau , M x cau ← T θ2 (g i ) // regions of causal features 7: Ma ← (1 -M a cau ) ⊙ M a adv + M a cau // augment edges 8: Mx ← (1 -M x cau ) ⊙ M x adv + M x cau // augment nodes 9: gi ← (A i ⊙ Ma , X i ⊙ Mx ) // Compute L cau + L reg 2 via Equation 11 and Equation 1313: Update parameters of adversarial augmenter via gradient ascent: θ 1 ← θ 1 + α∇ θ1 (L adv -L reg 1 ) 14: Update parameters of GNN and causal generator via gradient descent: θ ← θ -β∇ θ (L cau + L reg 2 ); θ 2 ← θ 2 -β∇ θ2 (L cau + L reg 2 ) 15: end while

A.2 DATASETS AND METRICS

Datasets. In this paper, we conduct experiments on graph OOD datasets (Gui et al., 2022) and OGB datasets (Hu et al., 2020) , which include Motif, CMNIST, Molbbbp and Molhiv. We follow Gui et al. (2022) to create various covariate shifts, according to base, color, size and scaffold splitting. Base, color, size and scaffold are features of the graph data and do not determine the labels of the data, so they can be regarded as environmental features. The statistics of the datasets are summarized in Table 3 . Below we give a brief introduction to each dataset. • Motif: It is a synthetic dataset from Spurious-Motif (Wu et al., 2022b; Sui et al., 2022) . As shown in original graphs in Figure 5 , each graph is composed of a base-graph (wheel, tree, ladder, star, path) and a motif (house, cycle, crane). The label is only determined by the type of motif. We create covariate shift according to the base-graph type and the graph size (i.e., node number). For base covariate shift, we adopt graphs with wheel, tree, ladder base-graphs for training, star for validation and path for testing. For size covariate shift, we use small-size of graphs for training, while the validation and the test sets include the middle-and the large-size graphs, respectively. • CMNIST: Color MNIST dataset contains graphs transformed from MNIST via superpixel techniques (Monti et al., 2017) . We define color as the environmental features to create the covariate shift. Specifically, we color digits with 7 different colors, where five of them are adopted for training while the remaining two are used for validation and testing. • Molbbbp & Molhiv: These are molecular datasets collected from MoleculeNet (Wu et al., 2018) . We define the scaffold and graph size (i.e., node number) as the environmental features to create two types of covariate shifts. For scaffold shift, we follow (Gui et al., 2022) and use scaffold split to create training, validation and test sets. For size shift, we adopt the large-size of graphs for training and the smaller ones for validation and testing. Metrics. We adopt classification accuracy as the metric for Motif and CMNIST. As suggested by Hu et al. (2020) , we use ROC-AUC for Molhiv and Molbbbp datasets. In addition, we use GCS(P, Q)  1e-3 1e-3 1e-3 1e-3 5e-3 1e-3 1e-2 Learning rate β 5e-3 5e-3 5e-3 1e-3 5e-3 1e-2 1e-2 Causal ratio λ 2 0.5 0.5 0.5 0.5 0.5 0.1 0.1 Adversarial penalty γ 0.2 0.2 0.2 0.5 0.5 0.5 0.5 to measure the covariate shift between distributions P and Q. For all experimental results, we perform 10 random runs and report the mean and standard derivations.

A.3 TRAINING SETTINGS

We use the NVIDIA GeForce RTX 3090 (24GB GPU) to conduct all our experiments. To make a fair comparison, we adopt GIN (Xu et al., 2019) as the default architectures to conduct all experiments. We tune the hyper-parameters in the following ranges: α and β ∈ {0.01, 0.005, 0.001}; λ 2 ∈ {0.1, ..., 0.9}; γ ∈ {0.01, 0.1, 0.2, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0}; batch size ∈ {32, 128, 256, 512}; hidden layers ∈ {32, 64, 128, 300}. The hyper-parameters of AdvCA are summarized in Table 4 .

A.4 BASELINE SETTINGS

For a more comprehensive comparison, we selected 14 baselines. In this section, we give a detailed introduction to the settings of these methods. • For ERM, IRM (Arjovsky et al., 2019) , GroupDRO (Sagawa et al., 2020) , VREx (Krueger et al., 2021) , and M-Mixup (Wang et al., 2021) , we report the results from the study (Gui et al., 2022) by default and reproduce the missing results on Molbbbp. • For DIR (Wu et al., 2022b) , CAL (Sui et al., 2022) , GSAT (Miao et al., 2022) , DropEdge (Rong et al., 2020) , GREA (Liu et al., 2022) , FLAG (Kong et al., 2022) and G-Mixup (Han et al., 2022) , they provide source codes for the implementations. We adopt default settings from their source codes and detailed hyper-parameters from their original papers. • For OOD-GNN (Li et al., 2022a) and StableGNN (Fan et al., 2021) , their source codes are not publicly available. We reproduce them based on the codes of StableNet (Zhang et al., 2021) . • For RDCA in Section 4.4, it is a variant that replaces the adversarial augmentation in AdvCA with random augmentation. In our implementation, we use all-one matrices to create the initial node and edges masks. Then we randomly set 20% of nonzero elements to zero in these masks. Finally, we apply these masks to the graphs for random data augmentation. The process of causal learning is consistent with AdvCA.

B ESTIMATION OF GRAPH COVARIATE SHIFT

In this section, we elaborate on the implementation details of estimating the graph covariate shift. Without loss of generality, we start with the example of estimating the graph covariate shift between the training and test distributions. Given training set and test set D tr and D te , they follow probability distribution functions P tr and P te . The process of estimating GCS(P tr , P te ) is summarized in the following two steps: • Firstly, it is intractable to directly estimate the distribution in graph space G. Inspired by Ye et al. (2022) , we can obtain the graph features and estimate the distribution in feature space F. Specifically, given a sample, we train a binary GNN classifier f to distinguish which distribution it comes from, where f (⋅) = Φ ○ h, h(⋅) ∶ G → F is a graph encoder, and Φ(⋅) ∶ F → {0, 1} is a binary classifier. Then we can adopt the pre-trained GNN encoder h to extract graph features. • Secondly, we prepare the features and estimate the distribution of the data via Kernel Density Estimation (KDE) (Parzen, 1962) . Finally, we adopt the Monte Carlo Integration under importance sampling (Binder et al., 1993) to approximate the integrals in Definition 1. We summarize these implementations in Algorithm 2. In lines 4 and 5, to avoid the label shift (Ye et al., 2022) , we adopt sample reweighting to ensure the balance of each class. for each (g i , y i ) ∈ B do 8: Compute loss ℓ(f (g i ), y i ) and back-propagate gradients 

C CORRELATION SHIFT & COVARIATE SHIFT

From Assumption 1, we can observe that environmental features easily change outside the training distribution, owing to their noncausal nature. Hence, distribution shifts are only caused by the environmental features. Specifically, we define the joint distribution of training and test data as P tr (G, Y ) and P te (G, Y ), respectively. Since their joint distribution can be rewritten as P tr (G, Y ) = P tr (Y |G)P tr (G) and P te (G, Y ) = P te (Y |G)P te (G), we can find that there exist two main reasons that lead to distribution shift P tr (G, Y ) ≠ P te (G, Y ). Although this work focuses on the OOD issue of covariate shift, for completeness, we also evaluate the performance of AdvCA under correlation shift. Following Gui et al. (2022) , we choose three graph OOD datasets (i.e., Motif, CMNIST, Molhiv) with three different graph features (i.e., base, color, size) to create correlation shifts. For baselines, we choose three generalization algorithms (i.e., ERM, IRM (Arjovsky et al., 2019) , VREx (Krueger et al., 2021) ), three graph generalization methods (i.e., DIR (Wu et al., 2022b) , CAL (Sui et al., 2022) , OOD-GNN (Li et al., 2022a) ) and three data augmentation methods (i.e., DropEdge (Rong et al., 2020) , FLAG (Kong et al., 2022) , M-Mixup (Wang et al., 2021) ). The experimental results are shown in Table 5 . We can observe that AdvCA can also effectively alleviate the correlation shift. These results demonstrate that AdvCA learns better causal features by encouraging environmental diversity, which can effectively break spurious correlations that are hidden in the training data.

D.2 RESULTS ON COMMONLY USED DATASETS

To demonstrate the effectiveness of the proposed AdvCA, we also conduct experiments on commonly used TU datasets (Morris et al., 2020) , which include MUTAG, NCI1, PROTEINS, COL-LAB, IMDB-B, IMDB-M. These are real world datasets and have negligible distribution shift. For training settings, we follow CAL (Sui et al., 2022) and adopt GIN (Xu et al., 2019) as our backbone model. The experimental results are shown in Table 6 . For the results, we can observe that our method can achieve the best performance over different datasets. Model size. In addition to the GNN backbone model, we also introduce two small networks for adversarial augmentation and causal learning. In our implementations, the parameters of AdvCA are around twice as large as those of the original GNN model.

F MORE RELATED WORKS

OOD Generalization (Shen et al., 2021) has been widely explored. Recent studies (Ye et al., 2022; Gui et al., 2022; Wiles et al., 2022) point out that OOD falls into two specific categories: correlation shift and covariate shift. Correlation shift denotes that the environmental features and labels establish a statistical correlation that is inconsistent in training and test data. Thus, the models prefer to learn spurious correlations and rely on shortcut features (Geirhos et al., 2020) for predictions, resulting in a large performance drop. In contrast, covariate shift indicates that there exist unseen environmental features in test data. The limited training environment makes this issue intractable. In recent years, OOD generalization on graphs is drawing widespread attention (Li et al., 2022a; b; Fan et al., 2021; Wu et al., 2022a; b; Miao et al., 2022; Yu et al., 2022; Sui et al., 2022; Liu et al., 2022; Chen et al., 2022) . However, these efforts mainly focus on correlation shift. While the issue of graph covariate shift is of great need but largely unexplored. Comprehensive Comparisons with EERM (Wu et al., 2022a) . Although EERM share similar goals with us, generating several environments through augmentation, there exist many technical and contribution differences. Firstly, EERM ignores the distinction between correlation shift and covariate shift problems, so it is not specifically designed for covariate shift. Different from them, we distinguish these two shifts in detail and design a novel framework specifically for covariate shift. Secondly, EERM does not model causal and environmental features, which results in the inability to explicitly distinguish them. In contrast, we explicitly model the environmental and causal features. Hence, we can effectively identify causal and environmental features and explicitly separate them from data. Thirdly, we also design a metric, GCS( P , P ), which can effectively measure the diversity of the environmental features for our augmented data. And we directly encourage the environmental diversity of the augmented samples by maximizing GCS( P , P ). However, EERM does not provide any evaluation metric for environmental diversity. To encourage the diversity, they "blindly" maximize the variance of the empirical risk in K environments. Finally, for generalization scope, EERM is based on the IRM (Arjovsky et al., 2019) by minimizing the empirical risk in K environments. In contrast, inspired by DRO (Sagawa et al., 2020) , we can guarantee the generalization within the robust radius ρ. We summarize the above detailed discussions in Table 7 .

G LIMITATION & FUTURE WORK

Although AdvCA outperforms numerous baselines and can achieve outstanding performance under various covariate shifts, we also prudently introspect the following limitations of our method. And we leave the improvements of these limitations as our future work.



We provide detailed discussions of these two distribution shifts in Appendix C. We provide a formal definition in Assumption 1.



Figure 1: P train and P test denote the training and test distributions. P drop and P ours represent the distributions of augmented data via DropEdge and AdvCA. AdvCA establishes a smaller covariate shift (↔) with test distribution than DropEdge (↔).

Figure 3: Visualizations of the augmented graphs via AdvCA.

Figure 4: (Left): Performance comparisons of different components in AdvCA. (Middle): Performance over different causal ratios λ c . (Right): Performance over different penalties γ.

Adversarial Causal Augmentation Require: Training set D tr ; Adversarial augmenter T θ1 (⋅); Causal generator T θ2 (⋅); GNN classifier f (⋅) with parameters θ; Learning rates α, β; Batch size N ; Causal ratio λ c ; Penalty γ. 1: Randomly initilize θ, θ 1 , θ 2 2: while not converge do 3: Sample a batch B tr ← {(g i , y i )} N i=1 ⊂ D tr 4:

Estimation of Graph Covariate shift Require: Training dataset D tr and test dataset D te ; Batch size N ; Loss function ℓ; GNN f = Φ ○ h; Importance sampling size M ; Threshold ϵ. Ensure: Estimated covariate shift GCS(P tr , P te ). 1: Initialize parameters of f 2: # Train a graph classifier 3: while not converge do 4: Sample a batch B tr ← {(g i , y i )} N i=1 ⊂ D tr and relabel all y i ← 0 5: Sample a batch B te ← {(g i , y i )} N i=1 ⊂ D te and relabel all y i ← 1 6: B ← B tr ∪ B te 7:

Figure 5: Visualizations of the augmented graphs via AdvCA.D.3 MORE VISUALIZATIONSWe display more visualizations of the augmented graph via AdvCA in Figure5. To demonstrate the superiority our method, we also visualize the captured causal features by AdvCA and compare with other baselines. The results are displayed in Figure6. From the results, we can easily observe that our method can find causal parts more accurately than other baseline methods.

Method: We design a graph augmentation method, AdvCA, which focuses on covariate shift issues. It maintains the stability of causal features while ensuring environmental diversity.• Experiment: We conduct extensive experiments on synthetic and real datasets. The experimental results with in-depth analyses demonstrate the effectiveness of AdvCA.

Performance on synthetic and real-world datasets. Numbers in bold indicate the best performance, while the underlined numbers indicate the second best performance.

Covariate shift comparisons with different augmentation strategies.

augmented graph

Statistics of graph classification datasets.

Hyper-parameter details of AdvCA.

Update the parameters of f via gradient descent and reset the gradients 11: end while 12: # Prepare the features for the estimation 13: Extract training and test feature sets F tr and F te via encoder h 14: F ← F tr ∪ F te 15: Scale F to zero mean and unit variance 16: ω ← fit by KDE the distribution of F 17: Split F to recover the original partition F ′ tr , F ′ GCS(P tr , P te ) ← GCS(P tr , P te )/2M

Correlation shift P tr (Y |G) ≠ P te (Y |G). If the statistical correlation of environmental features and labels is inconsistent in training and test data, a well-fitted model in training data may fail in test data, which is also known as spurious correlation or correlation shift(Ye et al., 2022). Formally, correlation shift describes the conditional distribution P tr (Y |G) ≠ P te (Y |G).• Covariate shift P tr (G) ≠ P te (G). If there exist environmental features in the test distribution that the model has not seen during training, it will also result in performance drop. This unseen distribution shift is well known as covariate shift(Gui et al., 2022). It means that the environmental features in test data are unseen in training data, which leads to P tr (G) ≠ P te (G). Hence, in Assumption 1, we quantitatively measure the covariate shift between P tr (G) and P te (G).

Performance comparisons on synthetic and real-world datasets with correlation shift.

Performance comparisons on TU datasets.

ETHICS STATEMENT

This paper does not involve any human subjects and does not raise any ethical concerns. We propose a graph data augmentation method to address the OOD issue of covariate shift. We conduct experiments on public datasets and validate the effectiveness on graph classification tasks. Our proposed method can be applied to practical applications, such as the prediction of molecular properties.

REPRODUCIBILITY STATEMENT

To help researchers reproduce our results, we provide detailed instructions here. For the implementation process of AdvCA, we provide the detailed algorithm in Appendix A.1. For datasets, we use publicly available datasets, and a detailed introduction is provided in Appendix A.2. For training details, we provide training settings of AdvCA and baseline settings in Appendix A.3 and A.4, respectively. Furthermore, we also provide an anonymous code link: https://anonymous. 4open.science/r/AdvCA-68BF • AdvCA performs OOD exploration through an adversarial data augmentation strategy to achieve environmental diversity. However, it only perturbs the existing graph data in a given training set, such as perturbing original graph node features or graph structures. Hence, it is possible that there still exist some overlaps between the augmented distribution and training distribution, so Principle 1 cannot be thoroughly achieved. In future work, we will attempt to design more advanced data augmentation methods, such as graph generation-based strategies (Zhu et al., 2022) , to generate more unseen and novel graph data, for pursuing Principle 1.• For model training, we adopt adversarial training and causal learning to alternately optimize the adversarial augmenter, causal generator and backbone GNN. This training strategy may make the training process unstable, so the performance of AdvCA may experience a large variance. In addition, these two networks also involve additional parameters. Optimizing these parameters separately will also increase the time complexity, as shown in Appendix E. Hence, in future work, we will explore how to utilize more advanced optimization methods and lightweight models to achieve the principles of environmental diversity and causal invariance.

