WHICH INVARIANCE SHOULD WE TRANSFER? A CAUSAL MINIMAX LEARNING APPROACH

Abstract

A major barrier to deploy current machine learning models lies in their sensitivity to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Among these, graph-based methods causally decomposed the data generating process into stable and mutable mechanisms. By removing the effect of mutable generation, they identified a set of stable predictors. However, a key question regarding robustness remains: which subset of the whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we provide a comprehensive minimax analysis that fully characterizes conditions for a subset to be optimal. Particularly in general cases, we propose to maximize over mutable mechanisms (i.e., the source of dataset shifts), which is provable to identify the worst-case risk over all environments. This ensures us to select the optimal subset with the minimal worst-case risk. To reduce computational costs, we propose to search over only equivalent classes in terms of worst-case risk, instead of over all subsets. In cases when the searching space is still large, we turn this subset selection problem into a sparse min-max optimization scheme, which enjoys the simplicity and efficiency of implementation. The utility of our methods is demonstrated on the diagnosis of Alzheimer's Disease and gene function prediction.

1. INTRODUCTION

Current machine learning systems, which are commonly deployed based on their in-distribution performance, often encounter dataset shifts Subbaswamy et al. (2019) such as covariate shift, label shift, etc., due to changes in the data generating process. When such a shift exists in deployment environments, the model may give unreliable prediction results, which can cause severe consequences in safe-critical tasks such as healthcare (Hendrycks et al., 2021) . At the heart of this unreliability issue are stability and robustness aspects, which refer to the insensitivity of prediction behavior and generalization errors over shifts, respectively. For example, consider the system deployed to predict the Functional Activities Questionnaire (FAQ) score, which is commonly adopted Mayo (2016) to measure the severity of Alzheimer's Disease (AD). During prediction, the system can only access biomarkers or volumes of brain regions with anonymous demographic information for privacy consideration. However, the changes in demographics can cause shifts in covariates. To achieve reliability for the deployed model, it is desired for its prediction to be stable against demographic changes, and meanwhile to be constantly accurate over all different populations. To incorporate both aspects, this paper targets at finding the most robust (i.e., min-max optimal Müller et al. (2020) ) predictor, among the set of stable predictors over all distributions. To achieve this goal, many studies have proposed to learn invariance to transfer to unseen data. Examples include ICP Peters et al. (2016) and (Arjovsky et al., 2019; Liu et al., 2021; Ahuja et al., 2021) that assumed the prediction mechanism given causal features or representations to be invariant; Anchor Regression Rothenhäusler et al. (2021) that explicitly attributed the variation to exogenous variables. Particularly, the Subbaswamy & Saria (2020) ; Subbaswamy et al. (2019) causally decomposed the joint distribution into mutable M and stable S sets, with respectively changed and unchanged causal mechanisms. They then proposed to intervene on M to obtain a set of stable predictors. Still, a question regarding robustness remains: which subset of stable information should the model transfer, in order to be most robust against dataset shifts? The answer given by (a) (b) Figure 1 : FAQ prediction in Alzheimer's Disease. (a) Maximal mean square error (MSE) over test environments; (b) Maximal MSE of predictors that are ranked in ascending order from left to right, respectively according to the estimated worst-case risk of our method and the validation's loss of the graph surgery estimator Subbaswamy et al. (2019) . As shown, our method is more reflective of the maximal MSE than the graph surgery method. Subbaswamy et al. (2019) was to simply search over all subsets in S and took the one with the minimal validation loss. However, it lacks theoretical and practical guarantees for the validation's loss to reflect the worst-case risk, as shown by Fig. 1 (b) . To give a comprehensive answer, we first provide a graphical condition that is sufficient for the whole stable set to be optimal. This condition can be easily tested via causal discovery. When this condition fails, we prove that the worst-case risk can be identified by maximizing over the generating mechanism of M , i.e., the only source of shift. This conclusion ensures us to select the optimal subset in a more accurate way. Consider again the example of FAQ prediction in AD diagnosis, Fig. 1 (b) shows that our method is more reflective of the maximal mean squared error (MSE) than Subbaswamy et al. (2019) , which explains our advantage in predicting FAQ across patient groups shown in Fig. 1 (a). Besides, to reduce the searching cost, we propose to search over only equivalent classes in terms of worst-case risk. We however find that in some cases such a search can still be expensive. To improve efficiency in these cases, we turn this subset selection task into a sparse min-max optimization scheme, which alternates between a gradient ascent step on the M 's generating function and a sparse optimization with Lasso-type penalty to detect the optimal subset. We demonstrate the utility of our methods on a synthetic dataset and two real-world applications: Alzheimer's Disease diagnosis and gene function prediction. Contributions. We summarize our contributions as follows: 1. We propose to identify the optimal subset of invariance to transfer, guided by a comprehensive min-max analysis. To the best of our knowledge, we are the first to comprehensively study the problem of which part among all sources of invariance should the model transfer, in the literature of robust learning. 2. We introduce the concept of "equivalent relation" in terms of worst-case risk, in order to analyze the computational complexity, and propose a sparse min-max optimization method as a surrogate scheme to improve efficiency. 3. Our method can significantly outperform others in terms of subset selection and generalization robustness, on Alzheimer's Disease diagnosis and gene function prediction.

2. PRELIMINARIES AND BACKGROUND

Problem Setup & Notations. We consider the supervised regression scenario, where the system includes a target variable Y ∈ Y, a multivariate predictive variable X := [X 1 , ..., X d ] ∈ R d , and data collected from heterogeneous environments. In practice, different "environments" can refer to different groups of subjects or different experimental settings. We use {D e |e ∈ E Tr } to denote our training data, with D e := {(x e k , y e k )} ne k=1 ∼ i.i.d p e (x, y) being the data from environment e with sample size n e . The total number of training samples is n := e n e . We say a predictor f : R d → Y is stable if it can be learned from the training environments E Tr and transferred to a broader family of environments E without any adjustment. We denote the stable predictor set as F S and the distribution set as P := {P e (X, Y )} e∈E , with P e (X, Y ) the distribution over R d × Y in environment e. For a causal directed acyclic graph (DAG) G, we denote the parents, children, and descendants of the node set V as Pa(V), Ch(V) and De(V), respectively. For a subset V ′ ⊂ V, G V ′ denotes the sub-graph obtained by deleting edges pointing to any member of V ′ . We denote conditional independence and d-separation by ⊥ and ⊥ G , respectively. Our goal is to find the most robust predictor f * among the stable predictor set F S using data from E Tr . A commonly used way Peters et al. (2016) ; Ahuja et al. (2021) to measure this robustness is to investigate the predictor's worst-case risk, which provides a safeguard for deployment in unseen environments. That is, we want f * to have the following min-max property: f * (x) = argmin f ∈F S max P e ∈P E P e [(Y, f (x)) 2 ]. Next, we introduce the causal model, Markovian and faithfulness assumptions that our methods are based on. These assumptions are commonly made in the literature of causal inference and learning Pearl (2009) ; Spirtes et al. (2000) ; Arjovsky et al. (2019) . Assumption 2.1 (Causal Model). We assume that P e (X, Y ) is entailed by an unknown DAG G := (V, E) for all e ∈ E, where V := X ∪ Y denotes the node set and E denotes the edge set. Each variable V i ∈ V is associated with a structural equation g e i : V i ←g e i (Pa(V i ), U i ), where U i denotes the exogenous variable. Each edge in E represents a direct causal relationship (Pearl, 1995) . Assumption 2.2 (Markovian and Faithfulness). The Markovian means {U i } are mutually independent. Together with faithfulness, it means ∀ disjoint sets V i , V j , V k : V i ⊥ V j |V k ⇐⇒ V i ⊥ G V j |V k . Graph Surgery Estimator. Under assumptions 2.1, 2.2, the graph surgery estimator Subbaswamy et al. (2019) causally decomposed the joint distribution p e (x, y) into disentangled generating factors: p e (x, y) = p(y|pa(y)) i∈S p(x i |pa(x i )) i∈M p e (x i |pa(x i )), d S := |S|, d M := |M |, where S, M respectively denote stable and mutable sets such that X S :={X i |∀e ∈ E, p e (x i |pa(x i )) ≡ p(x i |pa(x i ))} contains variables with stable mechanisms and X M := {X i |∃ e 1 ̸ = e 2 ∈ E, p e1 (x i |pa(x i )) ̸ = p e2 (x i |pa(x i ))} contains those with unstable mechanisms. They then removed the unstable mechanisms by intervening on X M and obtained a set of stable predictors F S : {f S-:= E P [Y |x S-, do(x M )]|S -⊂ S} that are independent of e. In this regard, identifying f * in Eq. 1 is equivalent to selecting the optimal subset S * ⊂ S such that f * = f S * . To identify S * , they shown that the whole set S is optimal under the degeneration condition: p(y|x S , do(x M )) = p(y|x ′ ) for some X ′ ⊂ X. In the general cases, they searched over all S -⊂ S and selected the best one with minimal held-out validation loss. However, this analysis is theoretically incomplete and practically defective: i) it does not provide a procedure to test the degeneration condition, making it inapplicable; ii) the selected subset may not be min-max optimal, as the validation loss does not necessarily reflect the worst-case risk (Fig. 1 (b )); iii) the searching cost is exponentially expensive w.r.t. d S , making it hard to be applied to large-scale scenarios. In the next section, we will provide a comprehensive min-max analysis to remedy these issues.

3. METHODOLOGY

In this section, we introduce our method to identify S * . Specifically, in Sec. 3.1, we first introduce a comprehensive min-max analysis to identify the S * , followed by the learning method in Sec. 3.2. We then analyze the computational complexity in Sec. 3.3 via the lens of g-equivalence, and show that its searching cost can still be exponentially expensive in some cases. To improve efficiency in these cases, we in Sec. 3.4 introduce a sparse min-max optimization algorithm, which turns the subset selection problem into a sparse optimization scheme that enjoys model selection consistency.

3.1. IDENTIFICATION WITH MIN-MAX ANALYSIS

In this section, we introduce our method to identify the min-max optimal subset S * with theoretical guarantees. Our analysis is composed of two main results: Thm. 3.1 and Thm. 3.3. First, Thm. 3.1 provides a testable graphical condition that is sufficient for S * = S. When this condition fails, we show that the whole stable set S is not necessarily optimal via a counter-example. We then in Thm. 3.3 provide a sufficient and necessary condition for a subset to be optimal in the general cases. In the following, we first introduce the graphical condition for S * = S, which is equivalent to the degeneration condition in Subbaswamy et al. (2019) . Theorem 3.1 (Graphical Condition for f * = f S ). Suppose assumptions 2.1, 2.2 hold. Denote X 0 M := X M ∩ Ch(Y ) as mutable variables in Y 's children, and K := De(X 0 M )\X 0 M as descendants of X 0 M . Then, p(y|x S , do(x M )) can degenerate to conditional distribution if and only if Y does not point to any member of K. Further, under either of the two conditions, we have S * = S. Example 1. To illustrate, consider the example shown in Fig. 2 , where the graphical condition holds when the dashed arrow from Y → K does not exist. To see its equivalence to the degeneration condition, we can set K as stable variables. When Y ̸ → K, the path from Y to K can be blocked by X 0 M . We then have Y ⊥ G X 0 M K|X 0 M and thus p(y|k, do(x 0 M )) = p(y) according to the inference rules in Pearl (2009) . While when such a dashed arrow of Y → K holds, the K becomes a collider on the path between Y and X 0 M , making it incapable to remove "do" in p(y|k, do(x 0 M )). Compared to the degeneration condition, our graphical condition is more intuitive and can be easily tested via causal discovery, as guaranteed by the following proposition. Proposition 3.2. Under assumptions 2.1, 2.2, we have i) K is identifiable, and ii) Y → K is testable from joint distribution over training environments.

𝑌 𝐗

Thm. 3.1 also reminds us that the sufficient condition for S * = S may not always hold. Indeed, we provide a counter-example in Sect. B.3 in the appendix showing that the whole stable set has a larger worst-case risk than some subsets. To identify S * when the graphical condition fails, we turn to estimate the expected worst-case risk of each subset S -⊂ S from {D e } e∈ETr . By noticing that the variation of unstable mechanisms in X M is the only source of shifts in P e (X, Y ), we propose to parameterize these mechanisms and let them vary arbitrarily, in order to explore the behavior of the worst-case environment. Specifically, we consider a distribution family {P J } J for any J : Pa(X M ) → X M and P J := P (Y, X S |do(X M = J(P a(X M ))). By maximizing the population risk of over J for each subset, we can obtain the worst-case risk of this subset, as shown in Thm. 3.3. Theorem 3.3 (Min-max Property). Denote h * (S -) := max J E P J [(Y -f S-(x) ) 2 ] as the maximal expected risk in P J for subset S -. Then, we have h * (S -) = max P e ∈P E P e [ Y -f S-(x) 2 ]. In this regard, the optimal subset S * can be attained via S * := argmin S-⊂S h * (S -). This theorem informs us to optimize E P J [(Y -f S-(x) ) 2 ] over J to obtain h * for S -, as it equals the worst-case risk of using subset S -. With this theorem, it is sufficient to compare h * of each subset to identify the optimal one. The following proposition ensures that the optimization is tractable, as each component, i.e., Pa(X M ), P J , f S-used in optimization is identifiable. Proposition 3.4. Under assumptions 2.1, 2.2, the Pa(X M ), P J , and f S-are identifiable.

3.2. LEARNING METHOD

According to the last section, we have S * = S if Y ̸ → K. Otherwise, we can simply search over all subsets of S and compare their h * to identify the optimal one, as similarly adopted in Subbaswamy et al. (2019) . However, this exhaustive search can be redundant, as some subsets are equivalent in terms of prediction. Formally speaking, we introduce g-equivalence, i.e., ∼ G , as follows: Definition 3.5 (g-equivalence). For two subsets S i , S j , we say S i ∼ G S j if ∃S ij ⊆ S i ∩ S j such that Y ⊥ G X M (X Si ∪ X Sj \X Sij )|X Sij , X M . We call the elements of the quotient space Pow(S)/ ∼ G as g-equivalent classes, and denote N G := |Pow(S)/ ∼ G | as the number of equivalent classes. Under assumption 2.2, it is easy to see that if S i ∼ G S j , then we have P (Y |X Si , do(X M )) = P (Y |X Sj , do(X M )) and thus S i , S j have the same efficacy of robustness. In this regard, it suffices to search over Pow(S)/ ∼ G to identify the optimal subset, rather than exhaustively searching the Pow(S). To enable this searching, we provide a recovering algorithm that is provable to recover all g-equivalent classes. For the reason of coherence and space-saving, we leave this algorithm and its analysis to Sec. C.1 in the appendix. Equipped with Pow(S)/ ∼ G , we are now ready to introduce our algorithm to identify S * . Algorithm 1 Identification of the min-max optimal subset S * and predictor f * .

INPUT:

The training data {D e |e ∈ E Tr }. 1: Causal discovery to obtain the partially directed acylic graph (PDAG). 2: Detect K and whether Y → K. 3: if Y ̸ → K then 4: Set S * = S and estimate f * = f S . ◁ according to Thm. 3.1 5: else 6: Recover Pow(S)/ ∼ G . ◁ with Alg. 6 in Sec. C.1.

7:

Set h min = ∞, S * = ∅.

8:

for S G ∈ Pow(S)/ ∼ G do 9: Randomly pick a S -∈ S G , estimate f S-and h * (S -). 10: end for 14: end if 15: return S * and f * . if h * (S -) < h min then 11: Set h min = h * (S -), S * = S -, f * = f S- As the causal graph is unknown, Alg. 1 involves i) causal discovery to detect K and whether Y → K; ii) estimation of f S-; and iii) estimation of h * (S -). In the following, we roughly introduce the main ideas of our method and left the details to Sec. C in the appendix. Causal discovery to detect K and examine whether Y → K. We first detect a partial directed acyclic graph (PDAG) via the PC algorithm (Spirtes et al., 2000) , followed by our method to detect X M under the assistance of domain index variable E. Specifically, we have X i ∈ X M iff E → X i in the detected PDAG, according to Huang et al. (2020) . In a similar way, we can identify Pa(X i ), Ch(X i ) for i ∈ M , which is sufficient to detect X 0 M := X M ∩ Ch(Y ). Applying this method iteratively, we can detect De(X M ) and Pa(X i ) for X i ∈ De(X M ), which is sufficient to identify Eberhardt & Scheines (2007) To generate data from P , we first permute X M in a sample-wise manner to generate data from P (X M ). Then, we recursively regenerate data for each variable in De(X M ) from its parents in G X M , by estimating structural equations. This is tractable since De(X M ) and the parent nodes for each variable in De(X M ) are identifiable, as mentioned earlier. K := De(X 0 M ) \ X 0 M and have Y → K iff ∃X i ∈ K such that Y ∈ Pa(X i ). Estimate f S-. We adopt soft-intervention Estimate h * (S -). We first learn h(S -, J) := E P J [(Y -f S-(x)) 2 ]. As f S-can be estimated, we only need to obtain data from P J . As Pa(X M ) is identifiable, we iteratively regenerate data for X M from J(Pa(X M )) and also for X i ∈ Pa(X M ) from its parents, in order to obtain samples from p J . Then we maximize h(S -, J) over J to obtain h * (S -).

3.3. COMPLEXITY ANALYSIS

In this section, we discuss the complexity of Alg. 1. Benefit from the testable condition in Thm. 3.1, our algorithm enjoys a constant cost when Y does not point to any member of K. This situation happens when the target variable of interest represents the effect of the predictive covariates, e.g., the number of bike riding is decided by temperature, weather, etc. When Y does point to K, e.g., Y represents the disease and K represents its symptoms or biomarkers, Alg. 1 needs to search among g-equivalent classes and the complexity is O(N G ). We will give some examples to show that N G can be polynomial w.r.t. d S in some cases while can also exponentially increase w.r.t. d S , depending on the number of edges in the graph. Before this, we first show (Lemma. D.1 in appendix) that N G will not decrease (increase) if we add (delete) edges to (from) G. Claim 3.6. For chain, we have N G = O(d S ); for the skip-chain, we have N G = O(d 2k S ), where k is the number of added edges; for the knot graph, we have N G = O(c d S ) for some 1 < c < 2. We first consider the chain graph, i.e., Y → X S,(1) → ... → X S,(d S ) (where (1), ..., (d S ) is a permutation of 1, ..., d S according to the generating order in the chain) in Fig. 3 (a) , in which we find N G is polynomial as shown in Claim 3.6. This is simply because blocking X S,(i) will d-separate Y and X S,(j) for any j > i, making {X S,(i) } g-equivalent to {{X S,(i) , X S,(j1) , ..., X S,(j k ) } : i < j 1 ≤ ... ≤ j k for any k}. Next, we consider two cases of adding k edges to the chain: i) the skip-chain graph (Fig. 3 (b )) where k does not increasing with d S . In this case, N G is still polynomial since the number of paths between Y and any X S,(i) can be bounded; ii) the knot graph (Fig. 3 (c)) where k increases with d S , in this case, N G is shown to exponentially increase w.r.t. d S , because the number of paths between Y and X S,(i) can be exponentially large. Proof of Claim 3.6 and more examples are left to Sec. C.1 in the appendix.

3.4. SPARSE MIN-MAX OPTIMIZATION

According to the previous discussion, the overall searching complexity can still be exponentially large. To improve efficiency, we provide an alternative method that turns the subset selection problem of Eq. 1 into the following sparse min-max optimization scheme: min α,β max θ E p(x,y|x M =J θ (pa(x M ))) y -f α (x S β, x M ) 2 + λ β 1 , where we introduce the coefficient vector β and implement a lasso-type penalty on β with hyperparameter λ > 0. This penalty regularizes β to be sparse and its support set, i.e., supp(β) := {i : β(i) ̸ = 0} is used to select the optimal subset. To optimize, we alternatively take a gradient ascent with respect to θ, followed by the minimization over (α, β). Note that under irrepresentable and restricted convexity conditions Zhao & Yu (2006) , we have model selection consistency, i.e., the true support set of β can be recovered and ℓ 2 -consistency properties when d S is fixed, according to Rejchel (2016) . When d S increases with n, under restricted convexity conditions, we showed that this lasso-type estimator that belongs to a broader family of M -estimators Negahban et al. (2012) , are ℓ 2 -consistent. To further reduce the complexity in the minimization step, we propose to implement Linearized Bregman Iteration (LBI) via differential inclusion, which enjoys the efficiency of generating a whole regularization solution path. In each iteration, the original minimization step can be replaced by a gradient descent step followed by a soft thresholding step. Details are left to Sec. E in the appendix.

4. EXPERIMENT

In this section, we evaluate our method on synthetic data and two real-world applications: diagnosis of Alzheimer's Disease which is one of the most common types of dementia among elder people, and gene function prediction that can potentially help understand the human-disease progress Muñoz-Fuentes et al. (2018) . Compared Baselines. We compare our methods with the following baselines: i) ICP (Peters et al., 2016; Bühlmann, 2020) that assumed invariance of P (Y |Pa(Y )); ii) IC (Rojas-Carulla et al., 2018) that extended the above assumption to features beyond Pa(Y ); iii) Anchor regression (Rothenhäusler et al., 2021) that interpolated between ordinary least square (LS) and causal minimax LS, and constrained the residue in the anchor subspace to be small; iv) IRM (Arjovsky et al., 2019) that learned an invariant representation to transfer; v) HRM (Liu et al., 2021) that extended IRM to the case with unknown domain labels, by exploring the heterogeneity in data via clustering; vi) IB-IRM (Ahuja et al., 2021) that leveraged the information bottleneck regularization to supplement the invariance principle in IRM; and vii) Graph Surgery estimator Subbaswamy et al. (2019) that used validation's loss to identify the optimal subset. Implementation Details. We leave implementation details to Sec. F in the appendix. Data Generation. We follow Fig. 4 to generate X S := {X 1 , X 2 , X 3 } and X M := {X 4 } via the following structural equations: x 3 ← u 3 with u 3 ∼ N(-2, 1); x 2 ← g 2 (x 3 ) + u 2 with u 2 ∼ N(0, 1); y ← g y (x 2 ) + u y with u y ∼ N(0, 1); x 4 ← β e g 4 (y) + u 4 with β e = e -5 varied in different domains, u 4 ∼ N(0, 1); x 1 ← g 1 (x 4 , y) + u 1 with u 1 ∼ N(0, 0.2). We consider three settings: i) g 2 (x 3 ) = 0.5x 3 , g y (x 2 ) = -1.5x 2 , g 4 (y) = y, and g 1 (x 4 , y) = x 4 ; ii) g 1 (•) is changed to g 1 (x 4 , y) = x 4 + 2.5y; iii) g 2 (x 3 ) = 10sinc(x 3 ), g y (x 2 ) = 2tanh(x 2 ), g 4 (y) = -0.25y 3 + y, and g 1 (x 4 , y) = Sigmoid(x 4 + y). For each setting, we generate 10 environments with e = 1, ..., 10, where n e = 200 for each environment. To remove the effect of randomness, we repeat 10 times, and each time we randomly select five domains for training and the rest for testing.

4.1. SIMULATION STUDY

𝑋 ! 𝑋 " 𝑌 𝑋 # 𝑋 $ Causal Discovery and Complexity Analysis. We use F 1 score, precision, and recall in terms of directed edges, to assess our causal discovery algorithm. We repeat 10 times and the average results are F 1 = 0.99, precision = 1.00, and recall = 0.98, which suggests the validity of our algorithm. Validations on larger graphs are left in the appendix. As for complexity, in setting-1, the condition in Thm. 3.1 holds and we expect {X 1 , X 2 , X 3 } to be the optimal set. In setting-2,3, the condition is violated and we need to compare h * of each equivalent class to find the optimal subset. There are seven equivalent classes, as the only equivalent relation is Results Analysis. In Tab. 1, we report the estimated h * and the maximal mean squared error (MSE) over test sets of each subset S -and the vanilla regression method. As shown, in setting-1, the whole stable set enjoys the minimal max MSE, which agrees with Thm. 3.1; while in setting-2,3, the subset ({X 1 } in setting-2, {X 2 } in setting-3) identified by minimal h * has minimal max MSE. This suggests the effectiveness of our method in finding the optimal subset. Besides, we observe that h * can estimate the maximal MSE well in most cases, e.g., h * ({X 1 , X 2 }) = 0.06 vs 0.07 of max MSE in setting-2, h * ({X 3 }) = 3.10 vs 3.11 of max MSE in setting-3. In addition, equivalent subsets have similar performance, e.g., max MSE of {X 2 } and {X 2 , X 3 } in setting-3 are both 1.10, which verifies the searching can be conducted only over equivalent space. • ADNI. The dataset includes n = 757 patients enrolled in ADNI-GO/1/2 periods. We apply the Automatic Anatomical Labeling atlas (Tzourio-Mazoyer et al., 2002) and region index Young et al. (2018) to partition the whole brain into 9 brain regions: frontal lobe (FL), medial temporal lobe (MTL), parietal lobe (PL), occipital lobe (OL), cingulum (CIN), insula (INS), amygdala (AMY), hippocampus (HP), and pallidum (PL). In addition to brain region volumes, we also include gender (GED) and genetic information (number of ApoE-4 alleles (ApE)). With these covariates, we predict the Functional Activities Questionnaire (FAQ) score Y for each patient. We split the dataset into seven environments according to age (age <60, 60-65, 65-70, 70-75, 75-80, 80-85, >85) , which respectively contains n e = 27, 59, 90, 240, 182, 117, 42 samples. We repeat 15 times, with each time four domains are randomly selected for training and the rest for testing. • IMPC. The dataset contains the hematology phenotype of both wild-type and mutant mice with 13 kinds of single-gene knockout. To predict the function of the target gene, we knock it out and assess the cell counts of monocyte (MON), with cell counts of neutrophil (NEU), lymphocyte (LYM), eosinophil (EO), basophil (BA), and large unstained cell (LUC) as covariates. Each environment corresponds to a kind of gene-knockout. We repeat 45 times: each time five randomly picked gene knockouts and the wild-type are selected as training sets and the rest eight kinds are left for testing. {X 2 } ∼ G {X 2 , X 3 }. (a) (b) Causal Discovery. We implement our causal discovery algorithm in Sec. 3.2 to learn PDAG respectively shown in Fig. 5 (a, b ). For ADNI, Fig. 5 (a) shows that the affection of AD, measured by the FAQ score, firstly shows in the medial temporal lobe (MTL) and the hippocampus (HP), then propagates to other brain regions, which echos existing studies that MTL and HP are early degenerated regions (Barnes et al., 2009; Duara et al., 2008) . Besides, we observe that the frontal lobe (FL), pallidum (PAL), and hippocampus (HP) are mutable regions, which agrees with heterogeneity across different age groups found in existing studies Cavedo et al. (2014); Fiford et al. (2018) . For IMPC, Fig. 5 (b) shows that the monocyte (MON) affects the number of lymphocytes (LYM) and large unstained cells (LUC), which reflects the activation mechanisms of LYM (Carr et al., 1994) and LUC (Lee et al., 2021) . It also plays a role in the activation of basophil (BA) through the causal chain MON → LYM → BA, as also found in existing study (Goetzl et al., 1984) . Complexity Analysis. On both ADNI and IMPC, the condition in Thm. 3.1 is violated, as Y (FAQ on ADNI, MON on IMPC) points to K (MTL on ADNI, LYM on IMPC). So, we need to search over g-equivalent classes and compare their h * , as suggested by Thm. 3.3. The numbers of g-equivalent classes are 98 (out of the 2 8 = 256 subsets) on ADNI and 12 (out of the 2 4 = 16 subsets) on IMPC. Results Analysis. Fig. 1 (a) and Fig. 6 (a) report maximal MSE of our method and baselines. As we can see, our methods significantly outperform the others (7.8% on ADNI, 9.7% on IMPC). Besides, our sparse optimization is comparable to the searching method in Alg. 1. These results demonstrate the utility of Thm. 3.3 in identifying the optimal subset, as well as the effectiveness of show that our h * well reflects the worst-case risk, as it increases with the worst-case risk; while there is no such property for the validation's loss in the surgery estimator. Particularly, the top subsets ranked by our h * also have the minimal max MSE; while the one selected fails to identify S * . The improvements over ICP, IC, IRM, and their extensions are due to our utilization of stable information beyond causal features/representations. The advantage over Anchor regression may be due to the relaxation of the linearity assumption. In addition, Fig. 7 shows that subsets in the same equivalent class have similar maximal MSE, which further shows the validity of searching over only g-equivalent classes in Alg. 1. 

5. CONCLUSION

In this paper, we propose a minimax learning approach to identify the optimal subset of invariance to transfer, in order to achieve robustness against dataset shift. Among all subsets of stable information, we provide a sufficient and necessary condition for a subset to be min-max optimal. We analyze the searching complexity by introducing the notion of graphical equivalence and propose a sparse min-max optimization algorithm when the searching cost is expensive. The subset identified by our method outperforms others in terms of robustness, on Alzheimer's Disease diagnosis and gene function prediction tasks. In the future, we are interested to study the scenarios where the DAG is also allowed to vary across domains, which may happen when the number of environments is large. To obtain the optimal subset, they simply search over all subsets in S and took the one with the minimal validation loss. However, their method is theoretically incomplete and practically defective, as the selected subset may not be min-max optimal and the searching cost is expensive in large-scale graphs. In contrast, we provide a comprehensive min-max analysis to guarantee the identification of the optimal subset. For practical employment, we analyze the searching complexity via the lens of g-equivalence. For those graph with expensive searching cost, we provide a sparse min-max optimization scheme that can larger improve the efficiency. Optimization-based domain generalization. There are emerging works that view the domain generalization problem as an optimization problem. These methods directly formulate the objective of out-of-distribution generalization and conduct optimization for min-max optimum. For example, Distributional Robust Optimization (DRO) Duchi & Namkoong (2021) constrained the distance between test and training distribution with f-divergence or Wasserstein distance and optimized over the min-max objective. One of its popular extensions, GroupDRO Sagawa et al. (2019) , provided extra regularization (e.g., weight-decay or early stop) and allowed DRO models to achieve better performances in large neural networks. However, these methods heavily rely on data-driven optimization and lack analysis of the source of distributional shifts. For this reason, they have to constrain the distributional shifts to a limited extent, so as to achieve optimization convenience. Such a limitation affects their ability to generalize to a broader distribution family, thus limiting their applications in the real-world. In contrast, we consider domain generalization from a causal perspective. Benefiting from the causal framework for distributional shifts, our method can identify the reasons behind distributional shifts and achieve min-max optimum even when the distribution can vary arbitrarily. Causal discovery in heterogeneous data. Proof. Denote the causal DAG as G, the intervened graph that removes all arrowheads into V as G V . Define X 1 M := X M \X 0 M ,K 2 := (X\X 0 M )\De(X 0 M ). We firstly prove the equivalence of the following conditions (1), (2), and (3): 1. Y ⊥ G X 0 M K|K 2 ; 2. Y and K are not adjacent in Gfoot_0 ; 3. p(y|x S , do(x M )) can degenerate to conditional distribution. (1)→(2) If Y and K are adjacent, they are also adjacent in G X 0 M because K∩X 0 M = ∅, so Y and K can not be d-separated by any variable in G X 0 M , which contradicts with (1). (2)→(3) Define I := p(y|k, k 2 , do(x 0 M )) = p(y|pa(y)) Xj ∈K p(x j |pa(x j )) Xi∈K2 p(x i |pa(x i )) p(y|pa(y)) Xj ∈K p(x j |pa(x j )) Xi∈K2 p(x i |pa(x i ))dy . Since PA(Y ) ∩ {X 0 M , K} = ∅ and ∀X i ∈ K 2 , PA(X i ) ∩ {X 0 M , K} = ∅, we have: I = p(y, k 2 ) Xj ∈K p(x j |pa(x j )) p(y, k 2 ) Xj ∈K p(x j |pa(x j )dy . If Y and K are not adjacent, then ∀X j ∈ K, Y / ∈ PA(X j ). Therefore, I = p(y,k2) p(y,k2)dy = p(y|k 2 ). (3)→(1) We will prove by contradiction. Specifically, we will show that if Y ̸ ⊥ G X 0 M K|K 2 , i.e., (1) does not hold, then p e (y|x S , do(x M )) can not degenerate to any conditional distribution, i.e., (3) does not hold. We firstly show Y ̸ ⊥ G X 0 M K|K 2 ⇒ p e (y|x S , do(x M )) ̸ = p e (y|k 2 , do(x 0 M )) , then show p e (y|x S , do(x M )) ̸ = p e (y|k 2 , do(x 0 M )) ⇒ p e (y|x S , do(x M )) can not degenerate to any conditional distribution. Since Y / ∈ PA(X 1 M ), we have: p e (y|x S , x 1 M , do(x 0 M )) = p e (y, x S , x 1 M |do(x 0 M )) p e (y, x S , x 1 M |do(x 0 M ))dy = p e (y|pa(y)) i∈S p e (x i |pa(x i )) Xi∈X 1 M p e (x i |pa(x i )) p e (y|pa(y)) i∈S p e (x i |pa(x i )) Xi∈X 1 M p e (x i |pa(x i ))dy = p e (y|pa(y)) i∈S p e (x i |pa(x i )) p e (y|pa(y)) i∈S p e (x i |pa(x i ))dy = p e (y|x S , do(x M )) Since K ∪ K 2 = X S ∪ X 1 M , we have p e (y|x S , do(x M )) = p e (y|k, k 2 , do(x 0 M )). Thus, we can prove: Y ̸ ⊥ G X 0 M K|K 2 ⇒ p e (y|x S , do(x M )) = p e (y|k, k 2 , do(x 0 M )) ̸ = p e (y|k 2 , do(x 0 M )). Next, we prove p e (y|x S , do(x M )) ̸ = p e (y|k 2 , do(x 0 M )) ⇒ p e (y|x S , do(x M )) can not degenerate to any conditional distribution. Suppose p e (y|x S , do(x M )) = p e (y|k ′ , k 2 , do(x 0 M )). We will show if k ′ ̸ = ∅, then the dooperator can not be removed with either Rule 2 (action to observation) or Rule 3 (deletion of action). To express do(x 0 M ) explicitly, denote X 0 M = {X 0 M,i } r i=1 and p e (y|k ′ , k 2 , do(x 0 M )) = p e (y|k ′ , k 2 , do(x 0 M,1 ), . . . , do(x 0 M,r )). • Rule 2 can not remove the do-operator of any X 0 M,i ∈ X 0 M . Recall Rule 2 states that "p(y|do(x), do(z), w) = p(y|do(x), z, w) if Y ⊥ G XZ Z|X, W for any disjoint subsets of variables X, Y, Z, and W ". If Rule 2 can remove the do-operator of X 0 M,i ∈ X 0 M , then Y ⊥ G X 0 M \ { X 0 M,i } X 0 M,i X 0 M,i |K ′ , K 2 , X 0 M \ X 0 M,i . As we have Z = X 0 M,i , X = X 0 M \X 0 M,i , W = K ′ ∪ K 2 in the notations of Rule 2. In the following, we explain why Eq. 4 can not be true. Note X 0 M,i ∈ Ch(Y ) and the direct edge Y → X 0 M,i is reserved in the intervend graph G X 0 M \{X 0 M,i }X 0 M,i , which means that Y and X 0 M,i can not be d-separated by any set of variables in the intervened graph. Thus, Eq. 4 can not be true. • Rule 3 can not remove the do-operator of all X 0 M,i ∈ X 0 M . Recall Rule 3 states that "p(y|do(x), do(z), w) = p(y|do(x), w) if Y ⊥ G X,Z(W) Z|X, Wfoot_1 for any disjoint subsets of variables X, Y, Z, and W ". If Rule 3 can remove the do-operator of X 0 M , then: Y ⊥ G X 0 M (K ′ ∪K 2) X 0 M |K ′ ∪ K 2 (5) because the notations in Rule 3 mean X = ∅, Z = X 0 M , W = K ′ ∪ K 2 . In the following, we will show that when K ′ ̸ = ∅, Eq. 5 can not hold. When K ′ ̸ = ∅, note by definition K ′ ⊂ De X 0 M , so An (K ′ ) ∩ X 0 M ̸ = ∅. Therefore, X 0 M (K ′ ∪ K 2 ) = X 0 M \ {An(K ′ ) ∪ K 2 } ̸ = X 0 M . That is X 0 M \X 0 M (K ′ ∪ K 2 ) ̸ = ∅. Suppose X 0 M,i ∈ X 0 M \X 0 M (K ′ ∪ K 2 ), then the edge Y → X 0 M,i is in the intervened graph G X 0 M (K ′ ∪K2) , so Y and X 0 M,i can not be d-separated by any variable set. So Eq. 5 does not hold. In summary, we have proved that when K ′ ̸ = ∅, the do-operator on X 0 M can not be removed entirely by Rule 2 and 3 . Besides, according to Corollary 3.4.2 in Pearl (2009) , the inference rules are complete in the sense that if the intervention probability (with do ) can be reduced to a probability expression (without do ), the "reduction" can be realized by a sequence of transformations, each conforming to one of the Inference Rules 1-3. Note that only Rule 2 and 3 are related to the disappearance of do-operator, so it is sufficient to prove that Rule 2 and 3 can not remove the do-operator on X 0 M . We then prove under either of conditions (1), (2), or (3), f * = f S . Given any one of the three conditions (1), (2), or (3), f * (x) = E P e [Y |x S , do(x M )] satisfies the following min-max property: f * (x) = argmin f :X →Y max P ∈P E P [Y -f (x)] 2 . Under any one of the conditions (1)-(3), we have p e (y|x S , do(x M )) = p e (y|k 2 ) for P e ∈ P. For P e ∈ P, let p e x 0 M = Xi∈V \X 0 M p e (v) be the marginal distribution of X 0 M . Define P e as: p e (v) = p (y|pa(y)) Xi∈K p e (x i |pa(x i )) Xi∈K2 p e (x i |pa(x i )) p e x 0 M , by replacing the term Xi∈X 0 M p (x i |pa(x i )) in p e (v) with p e x 0 M . (i) By the definition of P, P e ∈ P and p e (y|x) = p e (y|x S , x 1 M , x 0 M ) = p e (y|x S , x 1 M , do(x 0 M )) = p e (y|k 2 ) (ii) In the following, we will show pe (y, k 2 ) = p e (y, k 2 ). First, note that K ⊂ De X 0 M and X 0 M ⊂ Ch(Y ), we have K ∪ X 0 M ⊂ De(Y ). Thus, PA(Y ) ∩ K ∪ X 0 M = ∅ because otherwise there would be a cycle. Second, PA(K 2 ) ∩ K ∪ X 0 M = ∅ because if there exist X i ∈ K ∪ X 0 M and also X i ∈ PA(K 2 ), then K 2 ∩ De X 0 M ̸ = ∅, which contradicts with the definition that K 2 := X\X 0 M \De X 0 M . In summary, we have PA(K 2 ∪ Y ) ∩ K ∪ X 0 M = ∅, which leads to p e (k 2 , y) = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) Π Xi∈K p e (x i |pa(x i )) Π Xi∈X 0 M p e (x i |pa(x i )) dx 0 M dk = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) Π Xi∈K p e (x i |pa(x i )) Π Xi∈X 0 M p e (x i |pa(x i )) dx 0 M dk = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) and pe (k 2 , y) = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) Π Xi∈K p e (x i |pa(x i )) p e x 0 M dx 0 M dk = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) Π Xi∈K p e (x i |pa(x i )) p e x 0 M dx 0 M dk = p(y|pa(y))Π Xi∈K2 p e (x i |pa(x i )) Therefore, we have p e (k 2 , y) = p e (k 2 , y). Note that K 2 ⊂ X, we have  Var P e (Y |K 2 ) = E P e [ [Y |x] = E P e Y |x S , x 1 M , do x 0 M = E P e [Y |x S , do(x M )], so f * (x) = E P * [Y |x] = E P * [Y |x S , do(x M )]. As E P e [Y |x S , do(x M )] is invariant for all P e ∈ P. Therefore, we have f * (x) = E P e [Y |x S , do(x M )]. B.2 PROOF FOR PROP. 3.2: TESTABILITY OF THM. 3.1 Proposition 3.2. Denote X 0 M := X M ∩ Ch(Y ) as mutable variables in Y 's children, and K := De(X 0 M )\X 0 M as descendants of X 0 M . Under assumptions 2.1, 2.2, the K is identifiable; besides, we can determine whether Y → K from the joint distribution over training domains. Proof. We firstly show K is identifiable. Since all variables in K are descendants of Y , we have Y → X i , X i ∈ K iff X i is adjacent to Y in the skeleton of DAG (which is identifiable under assumption 2.2). Thus, we can determine whether Y → K. Algorithm 2 Detection of X M and construct the causal skeleton of G 1. Start with X M = ∅. For V i ∈ V, test if V i ⊥ E or if there exist a subset C vi,e ⊆ V such that V i ⊥ E|C vi,e . If V i ̸ ⊥ E and there exists no such C vi,e , then include V i , X M = X M ∪ V i . 2. Start with an undirected graph G including edges for any two variables in V and the arrows E → V i for V i ∈ X M . For each pair of {Vi, Vj}. If Vi ⊥ Vj or there exists a subset Cv i ,v j ⊂ V such that Vi ⊥ Vj|Cv i ,v j , we delete the edge Vi -Vj from G. Note that K = (X\X 0 M ) ∩ De(X 0 M ) = (X\X 0 M ) ∩ De(X 0 M ) ∪ X 0 M . So it suffices to prove the identifiability of X 0 M ∪ De(X 0 M ), where X 0 M := X M ∩ Ch(Y ). This can be accomplished by three steps: (i) identification of X M , (ii) identification of X 0 M , and (iii) identification of X 0 M ∪ De(X 0 M ). The following algorithm shows step (i), which is the same as Huang et al. (2020) . The following Alg. 3 shows the steps (ii) and (iii), which basically relies on the faithful assumption (conditional independence in probability ⇒ d-separation in graph). Algorithm 3 Detection of X 0 M and X 0 M ∪ De(X 0 M ) 2.1 Detect X 0 M := X M ∩ Ch(Y ): 1: for X i ∈ X M and adjacent to Y do 2: If Y ̸ ⊥ E|C y,e ∪ {X i } 3: then X i ∈ X M ∩ Ch(Y ) 4: end for 2.2 Detect {Ch(Y ) ∩ X M } ∪ De(Ch(Y ) ∩ X M ) 1: Start with A = B = Ch(Y ) ∩ X M and visited(X i ) =FALSE 2: while B ̸ = ∅ do 3: for X j ∈ B do 4: for X i ∈ Adj(X j ) do 5: if X i ̸ ∈ X M and X i ⊥ E|C e,xi ∪ {X j } \ D xi,e then 6: A = A ∪ {X i } 7: if visited(X i ) =FALSE then 8: B = B ∪ {X i } 9: end if 10: end if 11: if X i ∈ X M and X i ̸ ∈ Adj(Y ) and X i ⊥ Y |C xi,y ∪ {X j } \ D y,xj then 12: A = A ∪ {X i } 13: if visited(X i ) =FALSE then 14: B = B ∪ {X i } 15: end if 16: end if 17: end for 18: Let B = B \ {X j } 19: end for 20: end while Explanations for 2.2 : • Line 1 in 2.2 : The set A is the final output. The set B only plays a part as an instrumental set that starts with X M ∩ Ch(Y ) and ends with ∅. During the process, B stores the nodes in X M ∩ Ch(Y ) that has not been searched for the children. Once X j ∈ B is searched, it is excluded from the set B (Line 18 ) and the children of X j are added to B if it has not been visited (Line 8 and 14), which is essentially a breadth-first-search algorithm. • Line 5 to 10 (the case when X i ̸ ∈ X M ): The fact X i ̸ ∈ X M means E and X i are not adjacent. Besides, note that X j ∈ X M ∩ Ch(Y ), there is a structure in the form E → • • • → X j -X i where X i and E are not adjacent. In the notation X i ⊥ E|C xi,e ∪ {X j } \ D e,xj , C xi,e denotes a separating set such that X i ⊥ E|C xi,e and D e,xj denotes the set of variables along the directed path between E → • • • → X j . The existence of C xi,e is guaranteed since X i and E are not adjacent, so a separating set has been found when constructing the skeleton. The set D e,xj is also clear as it is determined in the breadth-first-search process. • Line 11 to 19 (the case when X i ∈ X M ): Firstly, we explain why it is unnecessary to consider the case when X i ∈ X M and X i ∈ Adj(Y ). If X i ∈ PA(Y ), X i can not be in De(Ch(Y ) ∩ X M ) ∪ {Ch(Y ) ∩ X M } as it would induce a cycle in this way. If X i ∈ Ch(Y ), it means X i ∈ Ch(Y ) ∩ X M and has been identified in 2.1 and included in set A in the beginning. So the remaining case is when X i ∈ X M and X i ̸ ∈ Adj(Y ). Note in this case X j ∈ Ch(Y ) or X j ∈ De(Y ), there exists a structure Y → • • • → X j -X i , which is the same as E → • • • → X j -X i in Line 5 to 10. B.3 COUNTER EXAMPLE OF f * ̸ = f S 𝑌 𝑋 ! 𝑋 " Figure 8 : DAG of the counter example. Consider the DAG in Fig. 8 , in which we denote Y, X s , X m are binary variables. We will show that in this scenario, there exists P (Y ), P (X s |X m , Y ) such that f S := E[Y |x s , do(x m )] is not min-max optimal. We show this by proving that: E [Y -E[Y |x s , do(x m )]] 2 > E [Y -E[Y |do(x m )]] 2 . ( ) Since we have that E [Y -E[Y |x s , do(x m )]] 2 = E[Y 2 ] + E E 2 [Y |x s , do(x m )] -2E[Y • E[Y |x s , do(x m )]], and that E [Y -E(Y |do(x m ))] 2 = E[Y 2 ] -E[Y ] 2 due to that p(y|do(x m )) = p(y), the Eq. equation 6 is equivalent to that E E 2 [Y |x s , do(x m )] > 2E[Y • E[Y |x s , do(x m )]] -E 2 [Y ]. Besides, we have Denote a y := p(y = 1), p(x m = 1|y) := a my , p(x s = 1|x m , y) = a smy , then the left hand side in Eq. equation 7 has E E 2 [Y |x s , do(x m )] = xs,xm y p(x s |x m , y)p(x m |y)p(y) • E 2 [Y |x s , do(x m )] , E [Y • E[Y |x s , do(x m )]] = E E 2 [Y |xs, do(xm)] = 1(xs = 1, xm = 1) (as11am1ay + as10am0(1 -ay)) ayas11 ayas11 + (1 -ay)as10 2 + 1(xs = 1, xm = 0) [as11(1 -am1)ay + as10(1 -am0)(1 -ay)] ayas01 ayas01 + (1 -ay)as00 2 + 1(xs = 0, xm = 1) [(1 -as11)am1ay + (1 -as10)am0(1 -ay)] ay(1 -as11) ay(1 -as11) + (1 -ay)(1 -as10) 2 + 1(xs = 0, xm = 0) [(1-as01)(1-am1)ay + (1-as00)(1-am0)(1-ay)] ay(1-as01) ay(1-as01) + (1 -ay)(1 -as00) 2 . The right-hand side has 2E [Y E[Y |xs, do(xm)]] -E[Y 2 ] =2 1(xs = 1, xm = 1) a 2 y a 2 s11 am1 ayas11 + (1 -ay)as10 + 1(xs = 1, xm = 0) a 2 y a 2 s01 (1 -am1) ayas01 + (1 -ay)as00 + 1(xs = 0, xm = 1) a 2 y (1 -as11) 2 am1 ay(1 -as11) + (1 -ay)as10 + 1(xs = 0, xm = 0) a 2 y (1 -as01)(1 -am1) ay(1 -as01) + (1 -ay)(1 -as00) -a 2 y . When a y ̸ = 0, the term a 2 y can be removed. Then let a y → 0, the left-hand side approximates to: a 2 s11 a m0 a s10 + a 2 s01 (1 -a m0 ) a s00 + (1 -a s11 ) 2 a m0 (1 -a s10 ) + (1 -a s01 ) 2 (1 -a m0 ) (1 -a s00 ) ; and the right hand side approximates to: 2 • a 2 s11 a m1 a s10 + a 2 s01 (1 -a m1 ) a s00 + (1 -a s11 ) 2 a m1 (1 -a s10 ) + (1 -a s01 ) 2 (1 -a m1 ) (1 -a s00 ) - Then the Eq. equation 7 is equivalent to: a 2 s11 (am0 -2am1) as10 + a 2 s01 (2am1 -am0 -1) as00 + (1-as11) 2 (am0 -2am1) (1-as10) + (1-as01) 2 (2am1 -am0 -1) (1-as00) > -1. Let a s10 → 0, a s11 → 1, a s01 = a s10 = 0.5 and a m0 -2a m1 > 0, the above inequality holds.

B.4 PROOF FOR THM. 3.3: MIN-MAX PROPERTY

Theorem 3.3 (Min-max Property). Denote h * (S -) := max J E P J [(Y -f S-(x) ) 2 ] as the maximal expected loss over J for S -. Then, we have h * (S -) = max P e ∈P E P e [ Y -f S-(x) 2 ]. In this regard, the optimal subset S * for f * = f S * can be attained via S * := argmin S-⊂S h * (S -). Proof. We show the maximum loss is attained when X M is a definite function of PA(X M ) Let f S-(x S-, x M ) := E[Y |x S-, do(x M ) ] be an invariant predictor. Then L P e (f S-) = x y (y -f S-(x S-, x M )) 2 p(y|pa(y)) i∈S p(x i |pa(x i )) i∈M p e (x i |pa(x i ))dydx. And the maximum loss L * S-= argmax P e L P e (f S-) = argmax {p e (xi|pa(xi))|i∈M } L P e (f S-). Let X ′ := X\(X M ∪PA(X M )) and h(x M , pa(x M )) := x ′ (y -f S-(x)) 2 Xi∈X ′ p(x i |pa(x i ))d x ′ , which does not rely on the mutable distribution {p e (x i |pa(x i ))|i ∈ M }. Let m * (pa(x M )) := argmax x M h(x M , pa(x M ))). Firstly, consider the case of X M = {X M }. Then max P e LP e (fS -) = max P e pa(x M )   x M h(xM , pa(xM ))p e (xM |pa(xM ))dxM   X i ∈PA(X M ) p(xi|pa(xi))dpa(xM ) = pa(x M ) max P e   x M h(xM , pa(xM ))p e (xM |pa(xM ))dxM   X i ∈PA(X M ) p(xi|pa(xi))dpa(xM ) = pa(x M ) h(m * (pa(xM )), pa(xM )) X i ∈PA(X M ) p(xi|pa(xi))dpa(xM ). When X M is multivariate, we consider the maximization sequentially by the topological order {X M,1 , X M,2 , • • • , X M,l }, where X M,j is a node that is not a parent of any other nodes in {X M,i |i ≥ j} in the sub-graph over X M . That is, we firstly consider max p e (x M,1 |pa(x M,1 )) x M,1 h(x M,1 , pa x M,1 )p e (x M,1 |pa x M,1 )d xm,1 and factorize max P e {• • • } as max p e (x M,l |pa(x M,l )) • • • max p e (x M,2 |pa(x M,2 )) max p e (x M,1 |pa(x M,1 )) {• • • }. Note that the sub-graph on X M is always a DAG, so such a topological order always exists. Under assumptions 2.1 and 2.2, we have PA(X M ), P J , and f S-are identifiable. Proof. To generate data distributed as P J , we need to use J(PA(X M )) to regenerate X M , then regenerate De(X M ) with structural equations. To estimate f S-, we need to intervene on X M , then regenerate De(X M ) with structural equations. So, it's suffice to show X M , De(X M ), PA(X M ), PA(De(X M )) are identifiable. Identification of X M has been shown in Alg. 2. We first show the identification of De(X M ). • Line 5 to 10: this case is the same as Algorithm 3, which is based on (i) the structure E → • • • → X j -X i and (ii) X i and E are not adjacent. • Line 11 to 19: In this case, we identify the direction between X i and X j by the "Independent Causal Mechanism (ICM) Principle" following Huang et al. (2020) , where ∆ Xj →Xi and ∆ Xi→Xj are the estimated HSIC (see Eq. 17 in Huang et al. (2020) for the detailed formulation of ∆). The ICM principle means that "the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.". That is, the changes of P (X i |PA(X i )) does not influence the other mechanisms P (X j |PA(X j )) for j ̸ = i. The ICM principle is implied in the definition of "structural causal model" in Pearl (2009) , where each structural equation represents an autonomous physical mechanism. We then show the identification of PA(X M ), PA(De(X M )). Algorithm 5 PA(X i ) for X i ∈ X M ∪ De(X M ). 1: for Xj ∈ XM ∪ De(XM ) do 2: for Xi ∈ Adj(Xj) do 3: if Xi ̸ ∈ XM then 4: Xi ∈ PA(Xj) if Xi ̸ ⊥ E| Ce,x i \ De,X i ∪ {Xj} 5: else if Xj ∈ XM and Xi ∈ Xm then 6: Xi ∈ PA(Xj) when ∆X j →X i < ∆X i →X j . 7: else if Xj ∈ XM and Xi ̸ ∈ Xm then 8: Xi ∈ PA(Xj) when E ̸ ⊥ Xi|Cx i ,e ∪ {Xj} 9: end if 10: end 11: end for • Line 4: this rule is based on the structure E → • • • → X j -X i and {X i , E} are not adjacent. • Line 6: this rule is based on the HSIC criterion in Huang et al. (2020) . • Line 8:this rule is based on the structure E → X i -X j and {E, X j } are not adjacent.

C APPENDIX FOR SEC. 3.2: LEARNING METHOD C.1 EQUIVALENT CLASSES AND ITS RECOVERY ALGORITHM

When the graphical condition in Thm. 3.1 fails, Alg. 1 needs to search over subsets of the stable set and identify the optimal predictor. However, we find an exhaustive search of all subsets is redundant, as some subsets are equivalent in the sense of predicting Y . Formally speaking, Definition C.1 (p-equivalence). Two subsets X i and X j (the subscript S is omitted for simplicity) of the stable set X S are probabilistical equivalent, i.e., X i ∼ P X j , if P (Y |X i , do(X M )) = P (Y |X j , do(X M )). Remark C.2. It's straight forward to see ∼ P satisfies reflexivity (X i ∼ P X i ), symmetry (X i ∼ P X j ⇒ X j ∼ P X i ), and transitivity (X i ∼ P X j , X j ∼ P X k ⇒ X i ∼ P X k ), thus is a legal equivalent relation. Under the Markovian assumption, we can infer p-equivalence from structure of the causal graph, especially patterns of d-separations. In the following, we firstly give the notion of graphical equivalence, then show how can we infer p-equivalence from it. Definition C.3 (g-equivalence). Two subsets of vertex X i and X j are graphically equivalent w.r.t the causal graph G, i.e., X i ∼ G X j , if ∃X ij ⊆ X i ∩ X j such that Y ⊥ G X M X i ∪ X j \X ij |X ij , X M . It's straight forward to see that ∼ G satisfies reflexivity (X i ∼ G X i ) and symmetry (X i ∼ G X j ⇒ X j ∼ G X i ). To prove it also satisfies transitivity ( X i ∼ G X j , X j ∼ G X k ⇒ X i ∼ G X k ) , we need to introduce two properties of the d-separation. Lemma C.4 (Properties of d-separation). (i) If a path p can not be blocked by a vertex set X i , then any of p's sub-path can not be blocked by X i either. (ii) For two vertex sets X i , X j , and a path p, if p can not be blocked by X i and can be blocked by X i ∪ X j , then, X j must contain a non-collider in p. Proof. The correctness of property-(i) is straightforward, so we focus on proving property-(ii). Specifically, there are three possibilities about the path p: 1. all vertices in p are non-colliders. From 'p can not be blocked by X i ', we know all vertices in p are not in X i . From 'p can be blocked by X i ∪ X j ', we know at least a vertex in p is in X j . 2. all vertices in p are colliders. From 'p can not be blocked by X i ', we know ∀X ∈ p, X ∈ X i or X has a descendant in X i . So, ∀X ∈ p, X ∈ X i ∪ X j or X has a descendant in X i ∪ X j , which means p can not be blocked by X i ∪ X j neither. 3. vertices in p are colliders and non-colliders. From 'p can not be blocked by X i ', we know ∀X ∈ p, if X is a non-collider, X ̸ ∈ X i , if X is a collider, X or one vertex in Dec(X) is in X i , thus in X i ∪ X j . So, X j must contain a non-collider in p, otherwise, p can not be blocked by X i ∪ X j . Equipped with the above properties, we now show ∼ G also satisfies transitivity, i.e., X i ∼ G X j , X j ∼ G X k ⇒ X i ∼ G X k . Proof. Because X i ∼ G X j , X j ∼ G X k , by definition, ∃X ij ⊆ X i ∩ X j such that Y ⊥ G X i ∪ X j \X ij |X ij , and ∃X jk ⊆ X j ∩ X k such that Y ⊥ G X j ∪ X k \X jk |X jk . Different situation of X ij and X jk are discussed below: 1. X ij = X ij = X 0 . Then, we have Y ⊥ G X i ∪ X j ∪ X k \X 0 |X 0 . So, Y ⊥ G X i ∪ X k \X 0 |X 0 and X 0 ⊆ X i ∩ X k . So, X i ∼ G X k . 2. X ij ∩ X jk = ∅. As shown by Fig. 9 (a), we have X jk ⊆ X j \X ij , so, Y ⊥ G X jk |X ij . We also have X ij ⊆ X j \X jk , so, Y ⊥ G X ij |X jk . In the following, we show any path between Y and X jk contains at least a collider. We prove this by contradiction, i.e., assume there is a path p 0 :< Y, X 1 , X 2 , ..., X m > between Y and X jk (X m ∈ X jk ) and every vertex in p 0 is a non-collider in p 0 . Because Y ⊥ G X jk |X ij , so, ∃X i , i ≤ m -1 in p 0 such that X i ∈ X ij . So, there is a path p 1 :< X 1 , X 2 , ..., X i > between Y and X ij where every vertex is a non-collider. Because Y ⊥ G X ij |X jk , so again, ∃X l , l ≤ i-1 in p 1 such that X l ∈ X jk . Iterating like this, we have either X 1 ∈ X ij or X 1 ∈ X jk . Because X 1 ∈ Neig(Y ), Y ̸ ⊥ G X 1 given any subset, which contradicts with Y ⊥ X j \X jk |X jk and Y ⊥ X j \X ij |X ij . Because any path between Y and X jk (X ij can be similarly proved) contains at least a collider, we have Y ⊥ G X ij |∅ and Y ⊥ G X jk |∅. In the following, we show any path between Y and X i \X ij contains at least a collider. We prove this by contradiction, i.e., assume there is a path p 0 :< Y, X 1 , X 2 , ..., X m > between Y and X i \X ij (X m ∈ X i \X ij ) and every vertex in p 0 is a non-collider in p 0 . Because Y ⊥ G X i \X ij |X ij , so, ∃X i , i ≤ m -1 in p 0 such that X i ∈ X ij . So, there is a path p 1 :< X 1 , X 2 , ..., X i > between Y and X ij where every vertex is a non-collider, which contradicts with Y ⊥ G X ij |∅. Because any path path between Y and X i \X ij contains at least a collider, we have Y ⊥ G X i \X ij |∅ and similarly Y ⊥ G X k \X jk |∅. Considering Y ⊥ G X ij |∅ and Y ⊥ G X jk |∅, we now have Y ⊥ G X i |∅ and Y ⊥ G X k |∅. Because ∅ ⊆ X i ∩ X k and Y ⊥ G X i ∪ X k \∅|∅, we have X i ∼ G X k . Under review as a conference paper at ICLR 2023 In the following, we show any path between Y and A can be blocked by X ij ∩ X jk . We prove this by contradiction, i.e., assume there is a path p 0 :< Y, X 1 , X 2 , ..., X m > between Y and A such that p 0 can not be blocked by X ij ∩ X jk . Because A ⊆ X j \X jk , p 0 can be blocked by X jk . By Lemma C.4, X jk \X ij ∩ X jk (i.e., the subset B in Fig. 9  ! ! ! "! ! !# ! "! ∩ ! !# # $ ! "! ! !# ! ! ! " ∪ & ! ∪ ! # ' ( ! "! ∩ ! !# (a) (b) (c) ! "! ∪ ! !# 𝐗 ! 𝐗 " 𝐗 ! ∪ 𝐗 " ∪ 𝑿 # 𝐗 !" 𝐗 "# 𝐗 !" 𝐗 "# 𝐗 !" ∩ 𝐗 "# 𝐗 !" ∩ 𝐗 "# 𝐗 !" ∪ 𝐗 "# Figure 9 3. X ij ∩ X jk ̸ = ∅. We first define A := X ij \X ij ∩ X jk and B := X jk \X ij ∩ X jk , (b)) must contain a non-collider X i , i ≤ m -1 in p. So, there is a path p 1 :< Y, X 1 , X 2 , ..., X i > between Y and X jk \X ij ∩ X jk . Be Lemma C.4, as p 1 is a sub-path of p 0 , p 1 can not be blocked by X ij ∩ X jk neither. Because X i ∈ B ⊆ X j \X ij , p 1 can be blocked by X ij , so again, X ij \X ij ∩ X jk (i.e., the subset A in Fig. 9 (b)) must contain a non-collider in p 1 . Iterating like this, we have X 1 ∈ A or X 1 ∈ B. Because X 1 is adjacent with Y , we have either Y ⊥ G X j \X jk |X jk or Y ⊥ G X j \X ij |X ij not hold. This contradiction means any path between Y and A (similarly B) can be blocked by We have already shown that any path between Y and C can be blocked by X ij ∩ X jk , we in the following show any path between Y and D can also be blocked by X ij ∩ X jk . X ij ∩ X jk . Formally speaking, Y ⊥ G X ij ∪ X jk \X ij ∩ X jk |X ij ∩ X jk . Define C := X ij ∪ X jk \X ij ∩ X jk and D := X i ∪ X j ∪ X k \X ij ∪ X jk , Again, we prove this by contradiction, i.e., assume there is a path p 0 :< Y, X 1 , X 2 , ..., X m > between Y and D such that p 0 can not be blocked by X ij ∩ X jk . Because either X m ∈ X i ∪ X j \X ij or X m ∈ X j ∪ X k \X jk , we have p 0 can be blocked by X ij or X jk . Let's assume p 0 can be blocked by X ij (as by X jk has a similar analysis). By Lemma C.4, X ij \X ij ∩ X jk must contain a non-collider X i in p 0 . So, we have a path p 1 :< Y, X 1 , X 2 , ..X i > between Y and X ij \X ij ∩ X jk . Because p 1 is a sub-path of p 0 , by Lemma C.4, p 1 can not be blocked by X ij ∩ X jk either, which contradict with Y ⊥ G X ij ∪ X jk \X ij ∩ X jk |X ij ∩ X jk . This contradiction means any path between Y and D can also be blocked by X ij ∩ X jk . In conclusion, any path between Y and X i ∪ X j ∪ X k \X ij ∩ X jk can be blocked by X ij ∩ X jk . Formally speaking, Y ⊥ G X i ∪ X j ∪ X k \X ij ∩ X jk |X ij ∩ X jk . Because X i ∪ X k ⊆ X i ∪ X j ∪ X k and X ij ∩ X jk ⊆ X i ∩ X k , we have X i ∼ G X k . To conclude, we have shown that ∼ G satisfies reflexivity, symmetry, and transitivity. So, it is also a legal equivalent relation. Recall the Markovian assumption states for any disjoint sets X i , X j , X k , we have X i ⊥ G X j |X k ⇒ X i ⊥ X j |X k . It builds a bridge from d-separation in graph to conditional independence in probability. As a result, under this assumption, we can infer two subsets are equivalent in predicting Y if they are graphical equivalent in the intervened graph G X Mfoot_2 . Formally speaking, Proposition C.5. For two subsets X i and X j of the stable set, if X i ∼ G X j , then X i ∼ P X j . Remark C.6. Note that the reverse claim X i ∼ P X j ⇒ X i ∼ G X j is not true even under the faithfulness assumption. Consider the counter example of Y, X 1 , X 2 , and the structural equations X 1 ← Y +N (0, 1), X 2 ← Y +N (0, 1). We have {X 1 } ∼ P {X 2 }, but do not have {X 1 } ∼ G {X 2 }. Now that we know the definition of two subsets being equivalent and how to infer the equivalence from causal graph, we are ready to introduce the notion of g-equivalent class. Denote the power set of the stable variables X S as Pow(X S ), then elements of the quotient spaces Pow(X S )/∼ G are called g-equivalent classes. Since all predictors in the same g-equivalent class have the same power in predicting Y , the searching for optimal predictor in Alg. 1 should be conducted among g-equivalent classes. In the following, we introduce an algorithm to recover the Pow(X S )/∼ G space. The algorithm takes the stable graphfoot_3 G S as input and recursively explore stable variables in the order of their distance to Y . In each step of exploration, it create sub-graphs to represent conditional independence after including/excluding some stable subsets. We use the maximal ancestral graph (MAG) to construct these sub-graphs, thanks to its ability to preserve conditional independence when included (selection) or excluded (latent) variables exist. In the following, we omit the subscript S in G S and X S for brevity. Algorithm 6 P G = Recover(G) Input: a causal graph G. Output: the set of all g-equivalent classes P G . 1: Let X the covariate set of G 2: Find Neig G (Y ) 3: if Neig G (Y ) = ∅ then 4: return {Pow(X)} 5: else 6: P G ← {} 7: for T in Pow(Neig G (Y )) do 8: S ← T, L ← Neig G (Y )\T, O ← X\Neig G (Y ) 9: G ′ ← MAG(G, O, L, S) 10: P G ′ ← Recover(G ′ ) 11: for [X i ] in P G ′ do 12: for X j in [X i ] do 13: X j ← X j ∪ T 14: end for 15: end for 16: P G ← P G ∪ P G ′ 17: end for 18: return P G 19: end if Algorithm 7 G ′ = MAG(G, O, L, S) Input: a causal graph G over X = O ∪ L ∪ S. Output: a causal graph G ′ over O. 1: for each pair of variables A, B ∈ O, A and B are adjacent in G ′ if and only if there is an inducing path relative to < L, S > between them in G. 2: for each pair of adjacent vertices A, B in G ′ , orient the edge between them as follows: 3: (a) orient it as A → B in G ′ if A ∈ Anc G (B ∪ S) and B ̸ ∈ Anc G (A ∪ S); 4: (b) orient it as B → A in G ′ if A ̸ ∈ Anc G (B ∪ S) and B ∈ Anc G (A ∪ S); 5: (c) orient it as A ↔ B in G ′ if A ̸ ∈ Anc G (B ∪ S) and B ̸ ∈ Anc G (A ∪ S); 6: (d) orient it as A -B in G ′ if A ∈ Anc G (B ∪ S) and B ∈ Anc G (A ∪ S); Proposition C.7. Alg. 6 outputs the correct g-equivalent classes in causal graph G. DAG. The following proposition, which states all the Markovian equivalent graphs have the same g-equivalent classes, assures that our recovery algorithm can be applied to PDAG. Proposition C.9. Under assumption 2.2, causal graphs in the same Markovian equivalent class have the same g-equivalence. Proof. By definition, causal graphs in the same Markovian equivalent class have the same probability distribution. Under the Markovian and faithfulness assumptions, this means they have the same set of d-separations. As g-equivalence is defined on d-separation, they also have the same gequivalence.

C.2 DETAILS OF CAUSAL DISCOVERY TO DETECT LOCAL COMPONENTS

In this section, we summarize our method to detect X 0 M , De(X 0 M ), X M , De(X M ), Blanket(Y ), {PA(X i )} Xi∈X M ∪De(X M ) , PC(Y ) := PA(Y ) ∪ Ch(Y ). The Blanket(Y ) denotes the Markovian Blanket of Y . The identification of X 0 M ∪ De(X 0 M ) are in Alg. 3. To distinguish X 0 M and De(X 0 M ), it suffices to identify the direction of X i -X j in the case when both X i and X j are in X 0 M , which can be accomplished by comparing ∆ Xi→Xj and ∆ Xj →Xi (see Huang et al. (2020) for details). However, it should be noted that distinguishing X 0 M and De(X 0 M ) for the estimation of h(S -, J) and f S-is unnecessary. The identification of X M and PC(Y ) is in Alg. 2, where PC(Y ) can be obtained from the undirected skeleton. Blanket(Y ) can be identified by Aliferis et al. (2003) . The identification of X M ∪ De(X M ) is in Alg. 4 and we can distinguish X M from De(X M ) using the way as in {X 0 M , De(X 0 M )}. The parents {PA(X i )|X i ∈ X M ∪ De(X M ) } can be identified by Alg. 5.

C.3 DETAILS OF ESTIMATING f S-

To estimate f S-, we adopt soft-intervention to replace P e (XM |PA(XM )) with P (XM ) and hence define p ′ (x, y) = p(y|pa(y)) Πi∈S p(x i |pa(x i ))p(X M ). Then we have f S-= E P ′ [Y |x S-, X M ]. To generate data from P ′ , we first permute X M in a sample-wise manner to generate data from P (X M ). We then regenerate data for X M 's descendants in the intervened graph via estimating structural equationsfoot_4 , as summarized in Alg. 8. Algorithm 8 Estimation of f S-. INPUT: training data {x (k) , y (k) } n k=1 , S -⊂ S, X M , De(X M ), and {PA(X i )} Xi∈De G X M (X M ) . OUTPUT: Trained f S-. 1: Shuffling {(x M ) (k) } n k=1 by randomizing the indices. 2: For X i ∈ De G X M (X M ) do 3: Regenerate {(x i ) (k) } n k=1 as {g i (pa(x i ) (k) )} n k=1 . 4: Train f S-over the regenerated samples. Indeed, we only need to regenerate De G X M (X M ) ∩ Blanket(Y ) since p ′ (y|blanket(y)) = p ′ (y|x). To maximally reduce the approximation error in regeneration, we consider intervene on another variable set X * do := X 0 M ∪ (De(X 0 M ) \ Ch(Y )) and regenerate variables in De G X * do (X * do ). We prove De G X * do (X * do ) is the minimum regeneration set in the following proposition. Proposition C.10. Denote X * do := X 0 M ∪ (De(X 0 M ) \ Ch(Y )). Then: 1. For any admissible set X do , we have De G X do (X do ) ∩ Blanket(Y ) ⊃ De G X * do (X * do ); 2. X * do , De G X * do (X * do ), and {PA(X i )} Xi∈De G X * do (X * do ) are identifiable. Proof. (1) Firstly, we prove that a set of variables X do is admissible means p do (y|x) = p(y|x S , do(X M )) ⇔ {X M ∩Ch(Y )} ⊂ X do and {X S ∩Ch(Y )} ∩ X do = ∅. Under review as a conference paper at ICLR 2023 Note that p(y|x S , do(x M )) = p(y|pa(y)) Xi∈X S ∩Ch(Y ) p(x i |pa(x i )) y p(y|pa(y)) Xi∈X S ∩Ch(Y ) p(x i |pa(x i ))dy , p do (y|x) = p(y|pa(y)) Xi∈{X\X do }∩Ch(Y ) p(x i |pa(x i )) y p(y|pa(y)) Xi∈{X\X do }∩Ch(Y ) p(x i |pa(x i ))dy . It can be seen p do (y|x) = p(y|x S , do(X M )) ⇔ X \ X do ∩ Ch(Y ) = X S ∩ Ch(Y ) , which can be rewritten as {X M ∩Ch(Y )∩X C do } ∪ {X S ∩Ch(Y )∩X C do } = X S ∩Ch(Y ). The above equation holds if and only if {X M ∩Ch(Y )} ⊂ X do and {X S ∩Ch(Y )} ∩ X do = ∅. (2) Secondly, we prove that X * do is an admissible set and De G X * do (X * do ) = De(X M ∩ Ch(Y )) ∩ X S ∩ Ch(Y ). To simplify the notations, let X 0 := X * do and X 1 := De G X * do (X * do ). The conditions {X M ∩Ch(Y )} ⊂ X 0 and {X S ∩Ch(Y )}∩X 0 = ∅ hold by definition. ( 2.1) show De G X 0 (X 0 ) ⊂ X 1 Note X 0 ⊂ {X M ∩ Ch(Y )} ∪ {De(X M ∩ Ch(Y ))}, we have De(X 0 ) ⊂ De(X M ∩ Ch(Y )). Besides, since De G X 0 (X 0 ) = De(X 0 ) \ X 0 , Then De G X 0 (X 0 ) = De(X 0 ) ∩ X C 0 = De(X 0 ) ∩ {X M ∩Ch(Y )} C ∩ {De(X M ∩Ch(Y ))\Ch(Y )} C = De(X 0 ) ∩ {X C M ∪Ch(Y )}} ∩ {De(X M ∩Ch(Y )) C ∪Ch(Y )}} ⊂ {De(X M ∩Ch(Y ))} ∩ {X C M ∪Ch(Y ) C }} ∩ {De(X M ∩Ch(Y )) C ∪Ch(Y )}} = De(X M ∩Ch(Y )) ∩ X C M ∩ Ch(Y ) = De(X M ∩Ch(Y )) ∩ X S ∩ Ch(Y ) ⊂ X 1 (2.2) show X 1 ⊂ De G X 0 (X 0 ) Since X M ∩ Ch(Y ) ⊂ X 0 , De(X M ∩ Ch(Y )) ⊂ De(X 0 ), so X 1 ⊂ De(X M ∩ Ch(Y )) ⊂ De(X 0 ) and hence X 1 \X 0 ⊂ De(X 0 )\X 0 . Besides, note that X 0 ∩X 1 = ∅ such that X 1 \X 0 = X 0 and De G X 0 (X 0 ) = De(X 0 )\X 1 , we have X 1 ⊂ De G X 0 (X 0 ). (3) given X do satisfying the two conditions, we have X M ∩Ch(Y ) ⊂ X do ⇒ De(X M ∩Ch(Y )) ⊂ De(X do ); X do ⊂ {X S ∩Ch(Y )} C ⇒ {X S ∩Ch(Y )} ⊂ X C do . Therefore, De(X M ∩Ch(Y )) ∩ {X S ∩Ch(Y )} ⊂ De(X do ) ∩ X C do , Thus, X 1 ⊂ De G X do (X do ) for any X do satisfying X M ∩Ch(Y ) ⊂ X do and X do ∩{X S ∩Ch(Y )} = ∅. (4) The identification of {PA(X i )} Xi∈De G X * do (X * do ) , X * do and De G X * do (X * do ) can be readily obtained in Sec. C.2.

D APPENDIX FOR SEC. 3.3: COMPLEXITY ANALYSIS

In this section, we given some graphical examples and show the number of g-equivalent classes (denoted by N G ) over them. Lemma D.1 (Adding/Deleting Edges). In a causal graph G, adding edges does not decrease N G , deleting edges does not increase N G . Proof. 1. For any causal graph G 0 , add an edge in it and call the resulted graph as G 1 . We show N G0 ≤ N G1 . We show this by proving for any subsets X i , X j , if X i ̸ ∼ G0 X j , then X i ̸ ∼ G1 X j . We prove this by contradiction. Suppose there are X i , X j such that X i ̸ ∼ G0 X j and X i ∼ G1 X j . By X i ∼ G1 X j , we have ∃X ij ⊆ G1 X i ∩ X j , Y ⊥ G1 X i ∪ X j \X ij |X ij . Because adding an edge does not change the covariate sets, we have X ij ⊆ G0 X i ∩ X j . Because X i ̸ ∼ G0 X j , we have Y ̸ ⊥ G0 X i ∪ X j \X ij |X ij . In other word, there is a path p in G 0 between Y and X i ∪ X j \X ij such that p can not be blocked by X ij . This means X ij does not contain any non-collider on p, and X ij contains every collider (or its descendants) on p in G 0 . Because in G 1 , p is still a path between Y and X i ∪ X j \X ij . Besides, any collider X c on p in G 0 is still a collider on p in G 1 . Any variable X d ∈ Dec(X c ), where X c is a collider on p in G 0 , is still a descendant of the collider on p in G 1 . Any non-collider X n on p in G 0 is still a non-collider on p in G 1 . We have the path p can not be blocked by X ij in G 1 , neither, which contradicts with the claim that Y ⊥ G1 X i ∪ X j \X ij |X ij . 2. For any causal graph G 1 , delete an edge in it and call the resulted graph G 0 . It is straight forward to prove N G1 ≥ N G0 using the conclusion from 1. Example 2 (Chain). A chain graph is a graph whose skeleton is a chain, i.e., Y -X n -X n-1 -...-X 1 . For any chain graph with n covariates, N Gn = n + 1. Proof. We prove the claim by induction. Base. When n = 1, N G1 = 2 = n + 1 holds. Induction Hypotheses. Suppose for chain graphs with n covariates, the N Gn = n + 1. Step. When there is n + 1 covariates in the chain graph, i.e., G n+1 has a skeleton Y -X n+1 -X n -X n-1 -... -X 1 . As Neig Gn+1 (Y ) = {X n+1 }, we need to discuss the number of g-equivalent classes when including and excluding X n+1 . If X n+1 is a collider, then, when including X n+1 , the induced MAG G ′ has a skeleton Y -X n -X n-1 -... -X 1 ; when excluding X n+1 , skeleton of the induce MAG becomes Y X n -X n-1 -... -X 1 . By induction hypotheses, N G = n + 1 + 1 = n + 2 holds. We can have similar conclusion if X n+1 is a non-collider, . Example 3 (Star). A star graph with k-branches is a graph whose skeleton is composite of k disjoint chains. For any star graph with k-branches and n covariates, N Gn = O(n k ). Proof. A star graph is composite of k disjoint chain graphs, each containing n k covariates. So, we have N G = ( n k + 1) k = O(n k ). Example 4 (Circle). A circle graph is a graph whose skeleton is a circle, i.e., Y -X n -X n-1 -...-X 1 , Y -X 1 . For any circle graph with n covariates, N Gn = O(n 2 ). Proof. If X n is a collider, then, when including X n , the induced MAG will have a skeleton Y -X n-1 -... -X 1 , Y -X 1 , which is eventually a circle with n -1 covariates; when excluding X n+1 , skeleton of the induced MAG becomes a chain with n - 1 covariates Y -X 1 -X 2 -... -X n-1 . So, we have N Gn = n + N G ′ n-1 , which means {N Gn } n is an arithmetic sequence w.r.t. n. According to the summation formula of arithmetic sequence, we have Example 5 (Knots). A knots graph (shown in Fig. 11 (a) ) is a generalized directed chain graphs, where each knot contains 4 covariates. For a knot graph with n covariates, we have N Gn = O(n 2 ). 𝑌 𝑋 ! 𝑋 " 𝑋 # 𝑋 $ 𝑋 % 𝑋 & 𝑋 ' 𝑋 ( 𝑌 𝑋 " 𝑋 # 𝑋 $ 𝑋 % 𝑋 & 𝑋 ' 𝑋 ( 𝑌 𝑋 % 𝑋 & 𝑋 ' 𝑋 ( N Gn = O(c n ), for some constant 1 < c < 2. Proof. We prove the claim by showing the recursion formula of N G w.r.t. the knot number k. For the knot graph shown in Fig. 11 (a), the only neighbour of Y is X 1 . As X 1 is a non-collider, when including X 1 , Y will not have any neighbour in the induced MAG, so, the number of G-equivalent classes in this sub-graph will be 1. When X 1 is excluded, the induced MAG is shown in Fig. 11 (b) , where Y is adjacent to three covariates X 2 , X 3 , X 4 . As a result, we need to consider 2 3 = 8 combinations of covariates including/excluding. Out of the 8 combinations, 4 of them (including X 4 ) will induce MAGs where Y has no neighbours while the other 4 combinations (excluding X 4 ) will induce the MAG as shown in Fig. 11 (c), which is eventually a knot graph with k -1 knots. So, we have the recursion formula of N G w.r.t. the knot number k to be N G k = 1 + (4 + 4 • N G k-1 ). This formula means N G increases exponentially w.r.t. the knot number k. Because k = n 4 , N G also increases exponentially w.r.t. n. Proof. Note that covariates X 1 , X 2 , ..., X m in the sugar part may play different roles (non-collider or collider) on different paths through them. So, it can be troublesome to analyze Y 's neighbourhood as we did in the chain graph. Fortunately, covariates in the stick only belong to one path, so, we can use them to construct a upper bound of N Gn . 𝑌 𝑋 ! 𝑋 " 𝑋 # 𝑋 #$! 𝑋 #$" 𝑋 % 𝑌 𝑋 ! 𝑋 " 𝑋 # 𝑋 #$" 𝑋 % 𝑌 𝑋 ! 𝑋 " 𝑋 # 𝑋 #$" 𝑋 % Formally speaking, we can construct an upper bound of N Gn with corollary C.8. Specifically, when X m+1 is blocked, number of g-equivalent classes in the induced MAG will be less than 2 m ; when X m+1 is open, the induced MAG G ′ will be eventually a lollipop with n -1 covariates, as shown in Fig. 12 (c ). So, we have the following inequation: Example 7 (k-lollipop). A lollipop with k sticks is called a k-lollipop. For any k-lollipop, we have N Gn ≤ 2 m + N G ′ n-1 . Recursively performing the analysis on G ′ n-1 , we have N Gn ≤ 2 m + 2 m + N G ′′ n-2 ≤ ... ≤ 2 m + 2 m + ... + 2 m = 2 m (n -m + 1). As m is a constant number, we have N Gn = O(n). 𝑌 𝐵! " #$ 𝐵% " … … 𝐴! " #$ 𝐴% " 𝑌 𝐵% " 𝐵! " #" … … 𝐴% " 𝐴! " #" 𝑌 𝐵% " 𝐵! " #" … … 𝐴% " 𝐴! " #" N Gn = O(n k ). Proof. Let's firstly look at the number of g-equivalent classes in 2-lollipop, as shown in Fig. 13 (a). Following Example 6, we can construct an upper bound of N Gn by performing Corollary C.8 on {A m+1 , B m+1 }. There are 2 2 = 4 situations, specifically, (i) when A m+1 and B m+1 are both blocked, number of g-equivalent classes in the induced MAG will be less than 2 m ; (ii,iii) when one of covariate in {A m+1 , B m+1 } is blocked, the other is open, the induced MAG will be eventually a 1-lollipop, as shown in Fig. 13 (b), so the number of equivalent classes in this situation will be bounded by O( n 2 ); (iv) when both A m+1 and B m+1 are open, the induced MAG G ′ is a 2-lollipop with n -2 covariates, as shown in Fig. 13 (c ). To conclude, we have the following inequation: N Gn ≤ 2 m + O( n 2 ) + N G ′ n-2 . Recursively performing the analysis on G ′ n-2 , we have N Gn ≤ 2 m + O( n 2 ) + N G ′ n-2 ≤ 2 m + 2 m + O( n-2 2 ) + N G ′′ n-4 ≤ ... ≤ 2 m + ... + 2 m + O( n 2 ) + O( n-2 2 ) + ... + O(1) = O(2 m n) + O(n 2 ). As m is a constant number, we have N Gn = O(n 2 ). Inspired by this observation, we can analyze N Gn in k-lollipop by induction. Formally speaking, Base. For 2-lollipop, N Gn = O(n 2 ) holds. Induction Hypotheses. For k-lollipop, N Gn = O(n k ) holds Step. For k+1-lollipop, we can construct the upper bound of N Gn by performing Corollary C.8 on {X 1 m+1 , ..., X k+1 m+1 }, where X i m+1 is the left-most covariate on the i-th stick. There are 2 k situations: (i) when at least one of X 1 m+1 , ..., X k+1 m+1 is blocked, the induced MAG will be a lollipop with less than or equal to k sticks. According to the induction hypotheses, number of g-equivalent classes in these sub-graphs will be at most O(n k ). (ii) When all covariates of X 1 m+1 , ..., X k+1 m+1 are open, the induced G ′ will be eventually a k+1-lollipop with n -k covariates. So, we have the following inequation: Proof. Let's firstly look at the skip chain graph with m = 1, an example of which is shown in Fig. 14 (a ). As we can see, this example is constructed by adding skip connections between Y and X i1 . Following Example 6, we can construct an upper bound of N G1,n by performing Corollary C.8 on X i1-1 . When X i1-1 is blocked, the induced MAG is a two branches star graph, as show in Fig. 14  N Gn ≤ O(n k ) + N G ′ n-k . Recursively performing the analysis on G ′ n-k , we have N Gn ≤ O(n k ) + N G ′ n-k ≤ O(n k ) + O((n - k) k ) + ... + O(1) = O(n k+1 ). 𝑌 (a) A skip chain with 𝑚 = 1 𝑋 !" 𝑋 !"#" 𝑋 !"$" 𝑌 𝑋 !" 𝑋 !"#" 𝑋 !"$% (b) Induced MAG when 𝑋 !"$" is blocked 𝑌 𝑋 !" 𝑋 !"#" 𝑋 !"$% (c) Induced MAG when 𝑋 !"$" is open 𝑋 !# 𝑋 " 𝑋 !"$" 𝑋 !"%" 𝑋 !#$" 𝑋 !#%" 𝑋 & 𝑌 𝑋 !" 𝑋 !# (b) A constructed upper bound for (a) 𝑌 𝑋 !" 𝑋 !# 𝑋 " 𝑋 !"$# 𝑋 !"%" 𝑋 !#$" 𝑋 !#%" 𝑋 & (c) The induced MAG when 𝑋 !"$" is blocked (b). So, the number of g-equivalent classes in this sub-graph is O(n 2 ). When X i1-1 is open, the induced MAG G ′ will be eventually a skip chain with m = 1 and n-1 covariates, as shown in Fig. 14  (c ). So, we have the following inequation: N G1,n ≤ O(n 2 ) + N G ′ 1,n-1 . Recursively performing this analysis, we have N G1,n ≤ O(n 2 ) + N G ′ 1,n-1 ≤ ... ≤ O(n 2 ) + O((n -1) 2 ) + ... + O(1) = O(n 3 ). Now that we have analyzed the skip chain graph when m = 1, let's look at the situation when m = 2, an example of which is shown in Fig. 15 (a) . Firstly, we can construct an upper bound for any skip chain graph with m = 2 by adding extra connections among Y, X i1 , X i2 until these three vertices form a complete connection, as shown in Fig. 15 (b) . Then, we again perform Corollary C.8 on X i1-1 . When X i1-1 is blocked, the induced MAG contains two disjoint branches, one of which is a chain, the other is eventually a skip chain graph with m = 1. So, we have number of g-equivalent classes in this sub-graph O(n 3 ). When X i1-1 is open, the induced MAG will be a skip chain with m = 2 and n -1 covariates. So, similarly by the inequation when m = 1, we have N G2,n = O(n 4 ). Following the spirit of the above analysis, we show N Gm,n = O(n 2m ) by induction. Base. When m = 1, N G1,n = O(n 2 ) holds. Induction Hypotheses. Suppose for any skip chain graph with m, we have N G1,n = O(n 2m ). Step. For skip chain graph with skip connections among Y and X i1 , ..., X im+1 , firstly construct an upper bound by adding extra connections among Y and X i1 , ..., X im+1 , until these vertices form a complete connection. Then, in the resulted graph, perform Corollary C.8 on X i1-1 . When X i1-1 is blocked, the induced MAG contains two disjoint branches, one of which is a chain, the other is eventually a skip chain graph with less than or equal to m. So, we have number of g-equivalent classes in this sub-graph O(n 2m+1 ). When X i1-1 is open, the induced MAG will be a skip chain with m + 1 and n -1 covariates. So, similarly by the inequation when m = 1, we have N Gm,n = O(n 2m+2 ).

Lemma D.2 (Property of Tree).

A tree is an undirected graph in which any two vertices are connected by exactly one path. If there are d L leaves and d ≥3 vertices of degree at least three in the tree, then d L ≥ d ≥3 + 2. Proof. Denote number of all vertices in the tree as d T , then by the handshaking lemma, d L + 2(d T -d L -d ≥3 ) + 3d ≥3 ≤ d T i=1 deg(V i ) = 2(d T -1), which indicates d L ≥ d ≥3 + 2. Proposition D.3. Complexity of Alg. 6 is Θ(N G ). Proof. Treat each call of the MAG(•) function in Alg. 7 as a unit operation. 1. In the recursion tree of Alg. 6, number of all vertices d T is the complexity of Alg. 6, while number of leaves d L is N G . 

2.. Each interval vertices in the recursion

α,β max θ 1 2n n i=1 y i -f α (x i,S β, x i,M ) 2 + λ β 1 , where Based on this, we denote A := supp(β * ). In this regard, the optimization with respect to (α, β) is Lasso with a general loss. Since our goal is to recover the optimal subset A and the predictor with (α * , β * ), we are interested in the model selection consistency and ℓ 2 -consistency properties: • Model Selection Consistency: lim n→∞ P (A n ) = P (A), where A n := supp( βn ). • ℓ 2 -Consistency: lim n ∥ ζn -ζ * ∥ 2 2 = 0, where ζ := (α ⊤ , β ⊤ ) ⊤ . Here, we denote ζn := arg min n L(ζ) + λn n ∥β∥ 1 . The model selection consistency can ensure us to find the optimal subset and the ℓ 2 -consistency further guarantees the optimality of learned predictor. In the following, we discuss two settings: i) fixed when |S| = d is fixed; ii) high-dimensional d increases with n. We first introduce some assumptions, which are commonly made in Lasso Zhao & Yu (2006); Negahban et al. (2012) ; Rejchel (2016) : Assumption E.1 (Restricted Strongly Convexity (RSC)). We assume that L is convex; L and Q := E p[L] are twice differentiable and satisfies H := ∇ 2 L(α * , β * ) ⪰ γ * I and H := ∇ 2 Q(α * , β * ) ⪰ γ * I for some γ > 0. Assumption E.2 (Square-integrability of the gradient). We assume E |∂ℓ(α, β)| 2 < ∞ for each (α, β) in some neighborhood of (α * , β * ). Assumption E.3 (Irrepresentable condition). We assume that H A c ,(α,A) H † (α,A),(α,A) 0 sign(β * ) ∞ < 1. Remark E.4. The restricted strongly convexity condition has been widely assumed in variable selection Negahban et al. (2012) ; Zhao & Yu (2006) , especially in high-dimensional statistics to ensure the identifiability of the oracle parameter. The irrepresentable condition was almost necessary to recover the true signal set. For regularity, it was needed in general convex loss Niemiro (1992) ; Rejchel (2016) to ensure asymptotic normality. Now we are ready to introduce our results. We first introduce the model selection consistency with fixed setting. Before that, we first introduce two lemmas in Rejchel (2016) .  V (ζ) = 1 2 ζ ⊤ Hζ + j∈A ζ j sign(ζ * j ) + j̸ ∈A |ζ j |. Lemma E.6 (Theorem 2.3 in Rejchel ( 2016)). Under the same conditions in Lemma. E.5, we have |ζ-ζ * |≤M an a -1 n ∂L(θ) ∂ζ - ∂L(ζ * ) ∂ζ -H(ζ -ζ * ) → p 0. With this lemma, we have the following model selection consistency results: Theorem E.7. Under the same conditions in Lemma. E.5 and additionally assumption E.3, we have that lim n P (A n ) = P (A). The proof is very similar to Corollary 2.4 in Rejchel (2016) . We include it here for completeness. Proof. Denote L λn (ζ) := L(ζ) + λ n /n. Note that if there exists j ∈ A, then we have P (j ̸ ∈ A) = P ( ẑeta n (j) = 0) → 0 according to Lemma. E.5. Thus we have P (A ⊂ A n ) → 1. Next, we show that P (A n ⊂ A) → 1. Otherwise, ∀n > 0, there exists j ∈ A n but not belong to A. Recall that ζn minimizes L λn (ζ), we have that ∂L( ζn ) ∂β j + λ n n ∂|β j | = 0. Since β j ̸ = 0, we have that n λ n ∂L( ζn ) ∂β j = 1. Besides, we have that n λ n ∂L( ζn ) ∂β j = n λ n ∂L( ζn ) ∂β j - ∂L(ζ * ) ∂β * j -H( ζn -θ * ) + n λ n ∂L(ζ * ) ∂β * j + n λ n H( ζn -ζ * ). According to Lemma. E.6, we have that the first term converges to 0 in probability; besides, due to square-integrability and central limit theorem, we have that the second term also converges to 0 in probability. From Lemma. E.5, the third term converges to Hζ 0 in probability. Note that ζ 0 satisfies that:  H (α,A),(α,A) (α 0,⊤ , β 0,⊤ A ) ⊤ = (0 ⊤ , (sign(β * A )) ⊤ ) ⊤ , H A c ,(α,A) (α 0,⊤ , ζ 0,⊤ A ) ⊤ = ∂∂∥β 0 A c ∥ 1 ∂β 0 A c Therefore, we have |H A c ζ 0 | < 1 ≥ 2n∥∇L(ζ * )∥ ∞ . Then we have ∥ ζn -ζ * ∥ 2 2 = O λ 2 n n 2 γ 2 (|A| + dim(α)) Proof. According to theorem 1 in Negahban et al. ( 2012), we have that ∥ ζn -ζ * ∥ 2 2 = O λ 2 n n 2 γ 2 Ψ 2 , if λ n ≥ 2nR * (∇L(ζ * )) for some regularization funtion R, with R * denoting the conjugate function of R and Ψ := β̸ =0 R(β) ∥β∥ . In our setting, R(β) := ∥β∥ 1 . Therefore, we have Ψ ≤ |A| + dim(α). The proof is completed by noting that R * = ∥∥ ∞ . Remark E.9. According to square-integratility and the large law number theorem, we have that ∇L(ζ * ) → a.s. 0 and thus ∥∇L(ζ * )∥ ∞ → a.s. 0. Therefore, as long as λ n satisfies conditions in Thm. E.7, λ n can satisfy λ n ≥ 2n∥∇L(ζ * )∥ ∞ . In this regard, both model selection consistency and ℓ 2 -consistency in fixed setting can hold; while in high-dimensional setting, we have ℓ 2 -consistency, which is our ultimate goal, i.e., identifying the minimax optimal predictor.

E.2 LINEARIZED BREGMAN ITERATION

In this section, we introduce an alternative algorithm, namely Linearized Bregman Iteration (LBI) to replace the minimization step via Lasso. LBI was firstly proposed in Osher et al. (2005) in image denoising. In Osher et al. (2016) ; Huang & Yao (2018) , the authors established LBI's statistical model selection consistency from the perspective of differential inclusion. Such consistency holds under nearly the same condition in linear model; however additionally requires restricted strongly convexity to hold for each solution in the path, under general convex loss. More importantly, LBI enjoys more efficiency in implementation, compared to Lasso. Specifically, the condition on λ n for model selection consistency and ℓ 2 -consistency is in an asymptotic form. In practice, to select the optimal λ, Lasso has to set a sequence of hyperparameters and run an optimization algorithm for each hyperparameter. In contrast, LBI can generate a whole regularization solution path, with each iteration corresponding to a solution in the pat. Motivated by this property, we proposed to replace the minimization step via LBI, which is composed of a gradient descent followed by a soft-thresholding step. Combined with the gradient ascent step, the algorithm is showed as follows: Here, the δ is step size, z := ∥β∥ 1 + 1 2κ ∥β∥ 2 2 , and κ > 0 denotes the damping factor which is trade-off between efficiency and statistical properties. Specifically, Inverse Scale Space (ISS) which can return unbiased solutions, is the limit of LBI as κ → ∞; however, large κ leads to computation inefficiency by noticing that δ and κ should satisfy δκ < 1/λ max (∇ 2 ℓ) where λ max (A) denotes the maximal eigenvalue of A. Instead of running an optimization algorithm in the minimization step via Lasso, it only spends a gradient descent and a soft-thresholding steps, which is much more efficient for implementation. However, a disadvantage of using LBI to replace Lasso lies in the lack of statistical consistency guarantees, as the LBI alternates with the gradient ascent w.r.t. θ. F APPENDIX FOR SEC. 4: EXPERIMENT Implementation of Baselines. Vanilla uses E[Y |x] to predict Y and is implemented by the same neural network as f S-(which will be introduced later). Other baselines are implemented by the authors' official codes. Specifically, ICP (https://github.com/juangamella/icp); IC (https: //github.com/mrojascarulla/causal_transfer_learning); Anchor regression (https://github.com/rothenhaeusler/anchor-regression); IRM (https:// github.com/facebookresearch/InvariantRiskMinimization); HRM (https: //github.com/LJSthu/HRM); IB-IRM (https://github.com/ahujak/IB-IRM); As the Surgery Estimator did not provide official codes, we it following settings of our method. F.1 SIMULATION Implementation Details. In all three settings, SGD is used for optimization. In Setting-1,2, the structural equation x 1 ← g 1 (x 4 , y) + u 1 is estimated by a one-layer fully-connected neural network (FC), with training iterations set to 1000, the learning rate set to 0.01. In Setting-3, the equation is estimated by a two-layers FC with a sigmoid activation function in the hidden layer, with training iteration set to 1000, the learning rate set to 0.01. In Setting-1: f S-is parameterized by a one-layer FC, with training iterations set to 2000, the learning rate set to 0.001. J θ is parameterized by the same structure, with training iterations set to 2000, the learning rate set to 0.05. In Setting-2: f S-is parameterized by a one-layer FC, with training iterations set to 1000, the learning rate set to 0.001. J θ is parameterized by the same structure, with training iterations set to 5000, the learning rate set to 0.05. In Setting-3: f S-is parameterized by a two-layers FC with a sigmoid activation function in the hidden layer, with training iterations set to 5000, the learning rate set to 0.01. J θ is parameterized by the same structure, with training iterations set to 2000, the learning rate set to 0.01. The codes are implemented with PyTorch 1.10 and run on a server with an Intel Xeon E5-2699A v4@2.40GHz CPU. Additional Results on Causal Discovery. We randomly generate DAGs according to the Erdos-Renyi model Erdős et al. (1960) . We consider three low dimensional settings of nodes number {6, 8, 10} and a high dimensional setting with 100 nodes. For the low dimensional settings, we generate 10 domains, where the number of mutable variables is set to {2, 3}, and the sample size n e is set to 200 for each domain. For the high dimensional setting, the generated graphs are sparse. We generate 20 domains, where the number of mutable variable is set to 20, and the sample size n e is set to 500 for each domain. For the low dimensional settings, we implement the PC Spirtes et al. (2000) algorithm to learn the undirected skeletons. For the high dimensional setting, PC-stable Colombo et al. (2014) is used. Our algorithm is then used to determine local components. To remove the effect of randomness, we repeat for 40 times. We report the F 1 score, precision, and recall in Tab. 2. As we can see, when the causal graphs are more complicated, our discovery algorithm can still give accurate results, which further validates its effectiveness and stability. Comparison with Baselines. We report the maximal mean square error (max MSE) over the test sets for our method and baselines in Tab. 3. Besides, we in Tab. 4 report the standard deviation of mean square error (std. of MSE) over the test sets as a measure of transferring stability. As we can see, the maximum and standard deviation of MSE of our method are both low. For example, max MSE is 0.0075, and std. of MSE is 0.0006 in setting-2. This verifies that our method is both robust and stably transferable to distributional shifts. Besides, our method has a large improvement over baselines in 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 



Note the edge between Y and X ∈ K can only be Y → X. Z(W) is the set of Z-nodes that are not ancestors of any W -node in G X . The intervened graph means the graph after removing all edges into XM . The stable graph means the graph after removing all vertex in XM This can be achieved because XM , De(XM ), and their parents are identifiable, as shown in Alg. 5.



Figure 2: Illustration of Thm. 3.1.

to replace P e (XM |PA(XM )) with P (XM ) and define P (X, Y ) := P (Y |Pa(Y )) Πi∈S P (Xi|P a(Xi))P (XM ). Then we have f S-(x) = E P [Y |x S-, x M ]. Here we set p(xM ) := e∈E Tr pe e∈E Tr pe p e (xM ), with p e ≈ n e /n.

Figure 3: Examples of G X M : (a) chain (b) skip-chain (c) knot.

Figure 4: DAG with X M := {X 4 }. Dotted arrow exists in setting-2,3.

4.2 REAL-WORLD APPLICATIONSDatasets. We consider the Alzheimer's Disease Neuroimaging Initiative (ADNI) Petersen et al. (2010) dataset for Alzheimer's Disease diagnosis, and the International Mouse Phenotyping Consortium (IMPC) CRM workshop (2016) dataset for gene function prediction.

Figure 5: Learned causal graphs on (a) ADNI and (b) IMPC. ↔ to denote undirected edges.

Figure 6: Results on IMPC. (a) Maximal MSE over test environments. (b) Maximal MSE of predictors that are ranked in ascending order from left to right, respectively according to h * of our method and the validation's loss of the graph surgery estimator.

Figure 7: Max MSE of predictors in the same equivalent class on (a) ADNI and (b) IMPC datasets.

s |x m , y)p(x m |y)p(y) • y • E[Y |x s , do(x m )] . (9) Since we have p(y|x s , do(x m )) = p(y)p(xs|xm,y) y p(y)p(xs|xm,y) , we have E[Y |x s , do(x m )] = p(y = 1)p(x s |x m , y = 1) y p(y)p(x s |x m , y) . (10) Substituting Eq. equation 10 into Eq. equation 8, equation 9, we have E E 2 [Y |Xs, do(Xm)] = xs,xm y p(xs|xm, y)p(xm|y)p(y) • p(y = 1)p(xs|xm, y = 1) y p(y)p(xs|xm, y) 2 [Y • E[Y |Xs, do(Xm)]] = xs,xm y p(xs|xm, y)p(xm|y)p(y) • y • p(y = 1)p(xs|xm, y = 1) y p(y)p(xs|xm, y) = xs,xm y p(xs|xm, y = 1)p(xm|y = 1)p(y = 1) • p(y = 1)p(xs|xm, y = 1) y p(y)p(xs|xm, y).

PROOF FOR PROP. 3.4: IDENTIFIABILITY IN THM. 3.3 Proposition 3.4. Denote P J := p(y, x S |do(X M = J(pa(x M )))), and f S-:= E[Y |x S-, do(x M )].

as shown by Fig. 9 (b).

as shown by Fig. 9 (c).

Figure10

Figure 11: Recovering g-equivalence in knot graph.

Figure 12: Recovering g-equivalence in lollipop.

Figure 13: Recovering g-equivalence in 2-lollipop.

Figure 14: Recovering g-equivalence in skip-chain with m = 1. Example 8 (Skip Chain). A skip chain graph is constructed by adding skip connections among Y and m covariates in a chain. For any m-skip chain with n covariates, N Gm,n = O(n 2m ).

Figure 15: Recovering g-equivalence in skip chain with m = 2.

tree has degree at least three (one parent vertex in the tree and at least two children vertices in the tree). By Lemma D.2,d T +1 2 ≤ d L ≤ d T , which indicates d T = Θ(d L ) and thus complexity of Alg. 6 is Θ(N G ).E APPENDIX FOR SEC. 3.4: SPARSE MIN-MAX OPTIMIZATIONIn this section, we introduce theoretical analysis of the following empirical min-max optimization problem and besides, a more efficient algorithm called Linearized Bregman Iteration (LBI).

min

Lemma E.5 (Corollary 2.3 in Rejchel (2016)). Under assumptions E.1, E.2 and set λ n such that lim n λn n = 0 and lim n λn √ n , we have n λn ( ζn -ζ * ) → p ζ 0 := arg min ζ V (ζ) with

Maximization step:θ k+1 = θ k + δ∇ θ ℓ(β k , α k , θ k ), gradient ascent w.r.t. θ Linearized Bregman Iteration: α k+1 = α k -κδ∇ α ℓ(β k , α k , θ k ), gradient descent w.r.t. α z k+1 = z k -δ∇ β ℓ(β k , α k , θ k ),gradient descent w.r.t. β β k+1 = κsign(z k+1 ) max(0, |z k+1 | -1). soft-thresholding to obtain β

Mean Squared Error (MSE) on simulation data.

REPRODUCIBILITY STATEMENTData, code, and instructions to reproduce the main experimental results are provided. Specifically, the ADNI data set is available at http://adni.loni.ucla.edu), the IMPC data set is avaiable at http://www.crm.umontreal.ca/2016/Genetics16/competition_e.php, the codes are provided in the supplementary materials, the implementation instructions are provided in Sect. F in the appendix.Causal learning for domaingeneralization. There have been emerging works that consider the domain generalization problem from a causal perspective. One line of work Arjovsky et al. (2019); Xie et al. (2020); Müller et al. (2020) promoted invariance as a key surrogate feature of causation where the causal graph is more of a motivation. Another line of work Ilse et al. (2020); Lu et al. (2021); Mahajan et al. (2020); Mitrovic et al. (2021) considered domain generalization for unstructured data using specifically designed causal graphs to incorporate priors of the distribution shift, in which the causal features are modeled as latent variables to be inferred for robust prediction. The works most relevant to us pursued robust optimization by making invariance assumptions regarding causal mechanisms Subbaswamy et al. (2019); Bühlmann (2020); Peters et al. (2016); Subbaswamy & Saria (2020). Specifically, the Peters et al. (2016) assumed the generation of Y from its parents is invariant; hence they only utilize Y 's parents for transfer. The Subbaswamy et al. (2019); Subbaswamy & Saria (2020) considered a selection diagram framework, in which mutable variables are children of the selection variable. They then remove the unstable mechanism by intervening of X M and obtain a set of stable covariates S.

Our work shares a similar framework withHuang et al. (2020) in formulating the distribution shift. However, they focused on recovering the full causal graph to study relations among variables, we provide a local discovery procedure, which aids the analysis of min-max properties and identification of robust predictors.

P e Var P e (Y |X) = E P e Var P e (Y |K 2 ) ≥ E P e [Var P e (Y |X)] . (iv) In summary, for each P e ∈ P, we may construct P e such that E P e Var P e (Y |X) ≥ E P e [Var P e (Y |X)] . Denote P := P e |P e ∈ P and P * := argmax P ∈P E P [Var P (Y |X)], then P * ∈ P. Besides, note that for any P e ∈ P, E P e

Algorithm 4 Detection of De(X M ) ∪ X M

x 1 , ..., x n ∼ i.i.d p(x, y|x m = J θ (pa(x M ))).For simplicity, we use p to denote p(x, y|x m = J θ (pa(x M ))) in the rest of this paper.

since the irrepresentable condition holds. In this regard, we have that which contradicts to the fact that j ∈ A n .Next we show that in both fixed and high-dimensional settings, we have the following ℓ 2 -consistency, which is a natural conclusion applying the results in M -estimatorNegahban et al. (2012): Theorem E.8. Under assumptions E.1 and suppose λ n

Performance of Causal Discovery.

Maximal MSE comparison on simulation data. 1.90 ±.58 2.17 ±1.20 1.68 ±.54 1.38 ±.10 1.34 ±.23 2.69 ±1.74 1.58 ±.91 1.18 ±.06 1.18 ±.06 setting-2 .07 ±.00 .17 ±.31 .06 ±.02 .06 ±.04 .0071 ±.00 .33 ±.77 .29 ±.81 .0075 ±.0006 .0075 ±.00 setting-3 1.72 ±.72 1.61 ±.71 1.54 ±.62 2.98 ±1.07 2.34 ±.65 1.75 ±1.42 1.71 ±.41 1.10 ±.05 1.10 ±.05

Mean (over randomization) of std. (over test domains) of MSE on simulation data.

Mean (over randomization) of std. (over test domains) of MSE on ADNI dataset.

Mean (over randomization) of std (over test domains) of MSE on IMPC gene dataset.

Brain regions partition.

annex

Proof. We firstly introduce some notions that will be used in the proof. Denote [X i ] := {X j |X j ∼ G X i } the equivalent class with representative element X i . Denote the set of all equivalent classes as Pow(X)/ ∼ G . Define length of a path the number of edges in it. In a causal graph G, we say X i is Y 's w-order neighbour if the shortest path between Y and X i has length w. As a special case, X i is called 0-order neighbour of Y if there is no path between Y and X i . Let Ω(G) = 0 if Y does not have any neighbour, let Ω(G) = 1, 2, 3, ... if Y has 1, 2, 3, ...-order neighbour, respectively.Note that if we construct a MAG G ′ over O by G ′ = MAG(G, O, L, S), then, for any vertices setsThe proof is available at Sect. 2.3 in Zhang (2008) .In the following, we prove the proposition by induction on Ω(G).Base. For any causal graph G with Ω(G) = 0, we have Neig(Y ) = ∅. So, for any X i , X j ⊆ X, we have ∅ ⊆ X i ∩ X j such that Y ⊥ G X i ∩ X j \∅|∅, which means X i ∼ G X j . This means Pow(X)/ ∼ G = {[X]}, i.e., all subsets of the covariate set are equivalent and there is only one equivalent class.Induction Hypotheses. Assume any causal graph G ≤w with Ω(G) ≤ w, Pow(X)/ ∼ G ≤w = Recover(G ≤w ).Step. In the following, we show any causal graph G w+1 with Ω(G) = w + 1, we have Pow(X)/ ∼ Gw+1 = Recover(G w+1 ).

Denote covariates in Neig

, where X 1 i contains all 1-order covariates and X other i contains the others. So, Pow(X) can be partitioned intoBy the aforementioned property of MAG Zhang (2008) , if we construct. So, the above equation can be further written as. By design of Alg. 6 (line-18), we have). So, we eventually have Pow(X)/ ∼ Gw+1 = Recover(G w+1 ).Corollary C.8 (g-equivalent Classes in Sub-graphs). Denote the causal graph G, the covariate set X. Let Z ⊆ X a subset of covariates. Denote covariates in Z as {X z 1 , X z 2 , ..., X z l }, and the power setDenote the number of G-equivalent classes in the causal graph G and sub-graphProof. Because we do not restrict the set Z to Neig(Y ), an element X i from R i and an element X j from R j may be graphical equivalent. So, the equal to mark in Prop. C.7 because a greater than or equal to mark.Indeed, the true causal DAG with complete orientation is not identifiable. What we can identify is a partially directed acyclic graph (PDAG), representing all Markovian equivalent graphs of the true the highly non-linear setting-3. As for the slight improvements over the baseline, it may be due to the simulation settings being simple enough for the vanilla method to only exploit X 2 for prediction.

F.2 ALZHEIMER'S DISEASE DIAGNOSIS

Implementation Details. The imaging data are acquired from structural Magnetic Resonance Imaging (sMRI) scan. After data-preprocessing via Dartel VBM (Ashburner, 2007) and Statistical Parametric Mapping (SPM) for segmentation, we partition the whole brain into 9 brain regions according to Tab. 8 and Tab. 7. Data normalization (w.r.t. mean and standard deviation) is used.All structural equations are estimated by a two-layers FC with a sigmoid activation function in the hidden layer. For structural equations generating X 2 , X 3 , the training takes 5000 iterations, with the learning rate set to 0.1. For those generating X 4 , X 5 , X 6 , X 7 , the training takes 2000 iterations, with the learning rate set to 0.1. f S-is parameterized by a two-layers FC with a sigmoid activation function in the hidden layer, with training iterations set to 5000, the learning rate set to 0.25 (decrease to 0.1 at iteration 4000). J θ is parameterized by the same structure, with training iterations set to 2000, the learning rate set to 0.25. SGD is used for optimization. For the sparsity-based optimization, we set the training iterations to 350, with learning rate set to 0.05, and penalty weight set to 2. Adam is used for optimization.We pick four domains with more than 40 patients as the training domains and test on the rest three domains. To remove the effect of randomness, we replicate over all the 15 possible train-test splits.Additional Results. We firstly report std. of MSE over the test sets for our method and baselines in Tab. 5. As we can see, our method outperforms other baselines by a significant margin. This result demonstrates the utility of our method in learning stably transferable predictors. Then, we compare the performance of all g-equivalent classes with more than one member in Fig. 18 . As we can see, most equivalent classes have similar performance (small deviations). As for the several classes with large deviations, it may be due to the approximation error incurred during inferring the causal graph.Next, we show the optimization curve of h(S -, J θ ) and max MSE for 100 randomly picked subsets S -⊂ S, in Fig. 16 and Fig. 17 . As we can see, the optimization over J θ is well converged, and the performance of different subsets is consistent with our expectations. This observation again suggests the utility of Thm. 3.5 in finding the optimal predictor. Finally, we show the loss curve of sparsity-based optimization in Fig. 20 . As we can see, the optimization over h * is well converged. We use the wide-type mice and three kinds of gene knockouts in the training domains. To remove the effect of randomness, we generate 45 replications, with each trial appending 2 out of the remaining 10 gene knockouts to the training domains and testing on the rest 8 gene knockouts.Additional Results. Firstly, we report std. of MSE over the test sets for our method and baselines in Tab. 6. Similarly, our method outperforms other baselines by a significant margin, which together with the Alzheimer's disease experiment, shows the utility of our method in learning stably transferable predictors. Then, we show the optimization curve of h(S -, J θ ) and the loss curve of sparsity-based optimization in Fig. 19 and Fig. 21 , respectively. As we can see, the optimizations both well converge. 

