LATENT CAUSAL INVARIANT MODEL

Abstract

Current supervised learning can learn spurious correlation during the data-fitting process, imposing issues regarding interpretability, out-of-distribution (OOD) generalization, and robustness. To avoid spurious correlation, we propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction. Specifically, we introduce latent variables that are separated into (a) output-causative factors and (b) others that are spuriously correlated to the output via confounders, to model the underlying causal factors. We further assume the generating mechanisms from latent space to observed data to be causally invariant. We give the identifiable claim of such invariance, particularly the disentanglement of output-causative factors from others, as a theoretical guarantee for precise inference and avoiding spurious correlation. We propose a Variational-Bayesian-based method for estimation and to optimize over the latent space for prediction. The utility of our approach is verified by improved interpretability, prediction power on various OOD scenarios (including healthcare) and robustness on security.

1. INTRODUCTION

Current data-driven deep learning models, revolutionary in various tasks though, heavily rely on i.i.d data to exploit all types of correlations to fit data well. Among such correlations, there can be spurious ones corresponding to biases (e.g., selection or confounding bias due to coincidence of the presence of the third factor) inherited from the data provided. Such data-dependent spurious correlations can erode the (i) interpretability of decision-making, (ii) ability of out-of-distribution (OOD) generalization, i.e., extrapolation from observed to new environments, which is crucial especially in safety-critical tasks such as healthcare, and (iii) robustness to small perturbation (Goodfellow et al., 2014) . Recently, there is a Renaissance of causality in machine learning, expected to pursue causal prediction (Schölkopf, 2019) . The so-called "causality" is pioneered by Judea Pearl (Pearl, 2009) , as a mathematical formulation of this metaphysical concept grasped in the human mind. The incorporation of a priori about cause and effect endows the model with the ability to identify the causal structure which entails not only the data but also the underlying process of how they are generated. For causal prediction, the old-school methods (Peters et al., 2016; Bühlmann, 2018) causally related the output label Y to the observed input X, which however is NOT conceptually reasonable in scenarios with sensory-level observed data (e.g. modeling pixels as causal factors of Y does not make much sense). For such applications, we rather adopt the manner in Bengio et al. (2013) ; Biederman (1987) to relate the causal factors of Y to unobserved abstractions denoted by S, i.e., Y ← f y (S, ε y ) via mechanism f y . We further assume existence of additional latent components denoted as Z, that together with S generates the input X via mechanism f x as X ← f x (S, Z, ε x ). Taking image classification as an example, the S and Z respectively refer to object-related abstractions (e.g., contour, texture, color) and contextual information (e.g., light, view) . Such an assumption is similarly adopted in the literature of nonlinear Independent Components Analysis (ICA) (Hyvarinen and Morioka, 2016; Hyvärinen et al., 2019; Khemakhem, Kingma and Hyvärinen, 2020; Teshima et al., 2020) and latent generative models (Suter et al., 2019) , which are however without separation of output (y)-causative factors (a.k.a, S) and other correlating factors (a.k.a, Z) that can both be learned in data-fitting process. We encapsulate these assumptions into a novel causal model, namely Latent Causal Invariance Model (LaCIM) as illustrated in Fig. 1 , in which we assume the structural equations f x (associated with S, Z → X), f y (associated with S → Y ) to be the Causal Invariant Mechanisms (CIMe) that hold under any circumstances with P(S, Z) allowed to be varied across domains. The incorporation of these priories can explain the spurious correlation embedded in the back-door path from Z to Y (contextual information to the class label in image classification). To avoid learning spurious correlations, our goal is to identify the intrinsic CIMe f x , f y . Specifically, we first prove the identifiability (i.e., the possibility to be precisely inferred up to an equivalence relation) of the CIMe. Notably, far beyond the scope in existing literature (Khemakhem, Kingma and Hyvärinen, 2020) , our results can implicitly, and are the first to disentangle the output-causative factors (a.k.a, S) from others (a.k.a, Z) for prediction, to ensure the isolation of undesired spurious correlation. Guaranteed by such, we propose to estimate the CIMe by extending the Variational Auto-encoder (VAE) (Kingma and Welling, 2014) to the supervised scenario. For OOD prediction, we propose to optimize over latent space under the identified CIMe. To verify the correctness of our identifiability claim, we conduct a simulation experiment. We further demonstrate the utility of our LaCIM via high explainable learned semantic features, improved prediction power on various OOD scenarios (including tasks with confounding and selection bias, healthcare), and robustness on security. We summarize our contribution as follows: (i) Methodologically, we propose in section 4.1 a latent causal model in which only a subset of latent components are causally related to the output, to avoid spurious correlation and benefit OOD generalization; (ii) Theoretically, we prove the identifiability (in theorem 4.3) of CIMe f x , f y from latent variables to observed data, which disentangles outputcausative factors from others; (iii) Algorithmically, guided by the identifiability, we in section 4.3 reformulate Variational Bayesian method to estimate CIMe during training and optimize over latent space during the test; (iv) Experimentally, LaCIM outperforms others in terms of prediction power on OOD tasks and interpretability in section 5.2, and robustness to tiny perturbation in section 5.3.

2. RELATED WORK

The invariance/causal learning proposes to learn the assumed invariance for transferring. For the invariance learning methods in Krueger et al. (2020) and Schölkopf (2019) , the "invariance" can refer to stable correlation rather than causation, which lacks the interpretability and impedes its generalization to a broader set of domains. For causal learning, Peters et al. (2016) ; Bühlmann (2018) ; Kuang et al. (2018) ; Heinze-Deml and Meinshausen (2017) assume causal factors as observed input, which is inappropriate for sensory-level observational data. In contrast, our LaCIM introduces latent components as causal factors of the input; more importantly, we explicitly separate them into the output-causative features and others, to avoid spurious correlation. Further, we provide the identifiability claim of causal invariant mechanisms. In independent and concurrent works, Teshima et al. (2020) and Ilse et al. (2020) also explore latent variables in causal relation. As comparisons, Teshima et al. (2020) did not differentiate S from Z; and Ilse et al. (2020) proposed to augment intervened data, which can be intractable in real cases. Other works which are conceptually related to us, as a non-exhaustive review, include (i) transfer learning which also leverages invariance in the context of domain adaptation (Schölkopf et al., 2011; Zhang et al., 2013; Gong et al., 2016) or domain generalization (Li et al., 2018; Shankar et al., 2018) ; and (ii) causal inference (Pearl, 2009; Peters et al., 2017) which proposes a structural causal model to incorporate intervention via "do-calculus" for cause-effect reasoning and counterfactual learning; (iii) latent generative model which also assumes generation from latent space to observed data (Kingma and Welling, 2014; Suter et al., 2019 ) but aims at learning generator in the unsupervised scenario.

3. PRELIMINARIES

Problem Setup & Notation Let X, Y respectively denote the input and output variables. The training data {D e } e∈Etrain are collected from the set of multiple environments E train , where each domain e is associated with a distribution P e (X, Y ) over X × Y and D e = {x e i , y e i , d e } i∈ [ne] i.i.d ∼ P e with [k] := {1, ..., k} for any k ∈ Z + . The d e ∈ {0, 1} m denotes the one-hot encoded domain index for e, where 1 ≤ m := E train ≤ n := e∈E train n e . Our goal is to learn a model f : X → Y that learns output (y)-causative factors for prediction and performs well on the set of all environments E ⊃ E train , which is aligned with existing OOD generalization works (Arjovsky et al., 2019; Krueger et al., 2020) . We use respectively upper, lower case letter and Cursive letter to denote the random variable, the instance and the space, e.g., a is an instance in the space A of random variable A. The [f ] A denotes the f restricted on dimensions of A. The Sobolev space W k,p (A) contains all f such that A ∂ A f α A=a p da < ∞, ∀α ≤ k. Structural Causal Model. The structural causal model (SCM) is defined as the causal graph assigned with structural equations. The causal graph encodes the assumptions in missing arrows in a directed acylic graph (DAG): G = (V, E) with V, E respectively denoting the node set and the edge set. The P a(k) denotes the set of parent nodes of V k for each V k ∈ V and the X → Y ∈ E indicates the causal effect of X on Y . The structural equations {V k ← f k (P a(k), ε k )} V k ∈V , quantify the causal effects shown in the causal graph G. By assuming independence among exogenous variables {ε k } k , the Causal Markov Condition states that P( {V k = v k } V k ∈V ) = Π k P(V k = v k |P a(k) = pa(k)). A back-door path from V a to V b is defined as a path that ends with an arrow pointing to V a (Pearl, 2009) .

4. METHODOLOGY

We build our causal model associated with Causal Invariant Mechanism (CIMe, i.e., f x , f y ) and a priori about the generating process in section 4.1, followed by our identifiability result for CIMe in section 4.2. Finially, we introduce our learning method to estimate CIMe in section 4.3. We introduce latent variables to model the abstractions/concepts that play as causal factors that generate the observed variables (X, Y ), which is more reasonable than assuming the X as the direct cause of Y in scenarios with sensory-level data. We explicitly separate the latent variables into two parts: the S and Z that respectively denote the y (output)-causative and y-non-causative factors, as shown by the arrow S → Y in Fig. 1 . Besides, the X and Y are respectively generated by S, Z and S, via structural equations (with noise) f x , f y , which are denoted as Causal Invariant Mechanisms (CIMe) that hold across all domains. The output Y denotes the label generated by human knowledge, e.g., the semantic shape, the contour to discern the object, etc. Hence, we assume the Y as the outcome/effect of these high-level abstractions (Biederman, 1987) rather than the cause (detailed comparison with Y → S is left in supplementary 7.7.1). We call the model associated with the causal graph in Fig. 1 as Latent Causal Invariance Model (LaCIM), with formal definition given in Def. 4.1.

4.1. LATENT CAUSAL INVARIANCE MODEL

As an illustration, we consider the image classification in which X, Y denote the image and the class label. Instead of X, i.e., the pixels, it is more reasonable to assume the causal factors (of X, Y ) as latent concepts (S, Z) that can denote light, angle, the shape of the object to generate X following the physical mechanisms. Among these concepts, only the ones that are causally related to the object, i.e., S (e.g., shape) are causal factors of the object label, i.e., Y . Following the physical or natural law, the mechanisms S, Z → X, S → Y invariantly hold across domains. The S := R qs , Z := R qz denote the space of S, Z, with P e (S, Z) (that characterizes the correlation between S and Z) varying across E (e.g., the object is more associated with a specific scene than others). We assume that the y-non-causative factor (i.e., Z) is associated with (but not causally related to) S,Y through the confounder C, which is allowed to take a specific value for each sample unit. Therefore, the back-door path Z ← C → S → Y induces the correlation between Z and Y in each single domain. Rather than invariant causation, this correlation is data-dependent and can vary across domains, which is known as "spurious correlation". In real applications, this spurious correlation corresponds to the bias inherited from data, e.g. the contextual information in object classification. This domain-specific S-Z correlation, can be explained by the source variable D, which takes a specific and fixed value for each domain and functions the prior of distribution of the confounder C, as illustrated in Fig. 1 . This source variable D can refer to attributes/parameters that characterize the distribution of S, Z in each domain. When such attributes are unobserved, we use the domain index as a substitute. Consider the cat/dog classification task as an illustration, the animal in each image is either associated with the snow or grass. The S, Z respectively denote the concepts of animals and scenes. The D denotes the sampler, which can be described by the proportions of scenes associated with the cat and those associated with the dog. The D generates the C that denotes the (time, weather) to go outside and collect samples. Since each sampler may have a fixed pattern (e.g. gets used to going out in the sunny morning (or in the snowy evening)), the data he/she collects, may have sample selection bias (e.g. with dogs (cats) more associated with grass (snow) in the sunny morning (or snowy evening) ). In this regard, the scene concepts Z can be correlated with the animal concepts S, and also the label Y . Definition 4.1 (LaCIM). The Latent Causal Invariance Model (LaCIM) for e ∈ E is defined as a SCM characterized by (i) the causal graph, i.e., the G = (V, E) with V = {C, S, Z, X, Y } and E = {C → S, C → Z, Z → X, S → X, S → Y }; and (ii) structural equations with causal mech- anisms {f c , f z , f s , f x , f y } embodying the quantitative causal information: c ← f c (d e , ε c ), z ← f z (c, ε z ), s ← f s (c, ε s ); x ← f x (s, z, ε x ); y ← f y (s, ε y ), in which {ε c , ε z , ε s , ε x , ε y } are independent exogenous variables that induce p fc (c|d e ), p fz (z|c), p fs (s|c), p fx (x|s, z), p fy (y|s). The CIMe f x , f y are assumed to be invariant across E. We call the environment-dependent parts: P e (S, Z) and P e (S, Z|X) as S, Z-prior and S, Z-inference in the following. Remark 1. We denote LaCIM-d s and LaCIM-d as two versions of LaCIM, with the source variable d s with practical meaning (e.g. attributes or parameters of P(S, Z)) observed or not. The observation of d s can be possible in some applications (e.g., age, gender that characterizes population in medical diagnosis). As for the LaCIM-d with d s unobserved, we use domain index D as a substitute. Denote C as the space of C. We assume that the C is finite union of disjoint sets {C r } R r=1 , i.e. C := ∪ R r=1 C r , such that for any c r,i = c r,j ∈ C r , it holds that p(s, z|c r,i ) = p(s, z|c r,j ) for any (s, z). Returning to the cat/dog classification example, the C denotes the range of time to collect samples, i.e., 00 : 00-24 : 00. The C can be divided into several time periods C 1 , ..., C R , such that the proportion of concepts of (animal,scene) given any c in the same period is unchanged, e.g., the dog often comes up on the grass in the morning. Further, since p(x, y|s, z) = p(x|s, z)p(y|s) is invariant, we have for each C r that p(x, y|c r,i ) = p(x, y|s, z)p(s, z|c r,i )dsdz = p(x, y|s, z)p(s, z|c r,j )dsdz = p(x, y|c r,j ) for any (x, y). That is, the {p(x, y|c r } cr∈Cr for each (x, y) collapse to a single point, namely p(x, y|c r ). In this regard, we have p e (x, y) := p(x, y|d e ) = R r=1 p(x, y|c r )p(c r |d e ). Besides, we assume the Additive Noise Model (ANM) for X, i.e., f x (s, z, ε x ) = fx (s, z) + ε x (we replace fx with f x without loss of generality), which has been widely adopted to identify the causal factors (Janzing et al., 2009; Peters et al., 2014; Khemakhem, Kingma and Hyvärinen, 2020) . We need to identify the CIMe (i.e., f x , f y ), guaranteed by the identifiability that ensures the learning method to distinguish S from Z to avoid spurious correlation, as presented in section 4.2. Traditionally speaking, the identifiability means the parameter giving rise to the observational distribution p θ (x, y|d e ) can be uniquely determined, i.e., p θ (x, y|d e ) = p θ (x, y|d e ) =⇒ θ = θ. Instead of strict uniqueness, we rather identify an equivalent class of θ (in Def. 4.2) that suffices to disentangle the y-causative features S from Z to avoid learning spurious correlation. To achieve this goal, we first narrow our interest in case when p(s, z|c) is exponential family in Eq. ( 1), in which we can respectively identify the S, Z up to linear and point-wise transformations given by theorem 4.3; then we generalize to any p(s, z|c) as long as it belongs to Sobolev space, as explained in theorem 4.4. A reformulated VAE is proposed to learn the CIMe practically. For generalization, note that the gap between two environments in terms of prediction given x, i.e., E p e 2 [Y |X = x] -E p e 1 [Y |X = x] = S p e 2 ( s|x) -p e 1 (s|x) p fy (y|s)ds, is mainly due to the inconsistency of S, Z-inference, i.e., p e (s, z|x) = p e (s, z|x) for e = e (for details please refer to theorem 7.1 in supplement 7.1). Therefore, one cannot directly apply the trained {p e (s, z|x), p e (y|x)} e∈Etrain to the inference model of new environment, i.e. p e (s, z|x), p e (y|x) for e / ∈ E train . To solve this problem and generalize to new environment, we note that since p fx (x|s, z) and p fy (y|s) are shared among all environments, we propose to inference s, z that give rise to the test sample x via maximizing the identified p fx (x|s, z), as a pseudo-likelihood of x given (s, z), rather than using S, Z-inference model which is inconsistent among environments. Then, we feed estimated s into invariant predictor p fy (y|s) for prediction.

4.2. IDENTIFIABILITY OF CAUSAL INVARIANT MECHANISMS

We present the identifiability claim about the CIMe f x , f y , which implicitly distinguishes the ycausative factors (a.k.a, S) from others (a.k.a, Z) for prediction, to provide a theoretical guarantee for avoiding spurious correlations. Notably, the S and Z play "asymmetric roles" in terms of generating process, as reflected in additional generating flow from S to Y . This "information intersection" property of S, i.e., f -foot_0 y (ȳ) = [f -1 x ] S (x) for any (x, ȳ) ∈ f x (S, Z) × f y (S) if y = f y (s) + ε y , is exploited to disentangle S from Z. Such a disentanglement analysis, is crucial to causal prediction but lacked in existing literature about identifiability, such as those identifying the discrete latent confounders (Janzing, Sgouritsa, Stegle, Peters and Schölkopf, 2012; Sgouritsa et al., 2013) ; or those relying on ANM assumption (Janzing, Peters, Mooij and Schölkopf, 2012) ; linear ICA (Eriksson and Koivunen, 2003) ; (Khemakhem, Kingma and Hyvärinen, 2020; Khemakhem, Monti, Kingma and Hyvärinen, 2020; Teshima et al., 2020) (Please refer to supplement 7.6 for more broad reviews). Besides, our analysis extends the scope of Khemakhem, Kingma and Hyvärinen (2020) to categorical Y and general forms of P(S, Z|C = c) that belongs to Sobolev space, in theorem 4.4. Note that our analysis does NOT require observing the original source variable d s . We first narrow our interest to a family class of LaCIM denoted as P exp in which any p ∈ P exp satisfies that (i) the S, Z belong to the exponential family; and that (ii) the Y is generated from the ANM. We will show later that P exp can approximate any P(S, Z|c) ∈ W r,2 (S × Z) for some r ≥ 2: P exp = LaCIM with y = f y (s) + ε y , p(s, z|c) := p T z ,Γ z c (z|c)p T s ,Γ s c (s|c) with p T t ,Γ t c (t) := qt i=1 exp kt j=1 T t i,j (t i )Γ t c,i,j + B i (t i ) -A t c,i for t = s, z, and e ∈ E, where {T t i,j (t i )}, {Γ t c,i,j } denote the sufficient statistics and natural parameters, {B i }, {A t c,i } denote the base measures and normalizing constants to ensure the integral of distribution equals to 1. [qt] . We define the ∼ p -identifiability for θ := {f x , f y , T s , T z } as: Definition 4.2 (∼ p -identifiability). We define a binary relation on the parameter space of X × Y: θ ∼ p θ if there exist two sets of permutation matrices and vectors, (M s , a s ) and (M z , a z ) for s and z respectively, such that for any (x, y) ∈ X × Y, Let T t (t) := [T t Γ t c := Γ t c,1 , ..., Γ t c,q t ∈ R k t ×q t Γ t c,i := [Γ t c,i,1 , ..., Γ t c,i,k t ], ∀i ∈ T s ([f -1 x ] S (x)) = M s Ts ([ f -1 x ] S (x)) + a s , T z ([f -1 x ] Z (x)) = M z Tz ([ f -1 x ] Z (x)) + a z , p fy (y|[f -1 x ] S (x)) = p fy (y|[ f -1 x ] S (x)), We say that θ is ∼ p -identifiable, if for any θ, p e θ (x, y) = p e θ (x, y) ∀e ∈ E train , implies θ ∼ p θ. It can be shown that ∼ p satisfies the reflective property (θ ∼p θ), the symmetric property (if θ ∼p θ then θ ∼p θ), and the transitive property (if θ1 ∼p θ2 and θ2 ∼p θ3, then θ1 ∼p θ3), and hence is an equivalence relation (details in supplement 7.2). This definition states that the S, Z can be identified up to permutation and point-wise transformation, which is sufficient for disentanglement of S and identifying the predicting mechanism p fy (y|[f -1 x ] S (x)). Specifically, the definition regarding f x implies the separation of S and Z unless the extreme case when S can be represented by Z, i.e., there exists a function h : S → Z such that [f -1 x ] S (x) = h([f -1 x ] Z (x) ). This definition is inspired by but beyond the scope of unsupervised scenario considered in nonlinear ICA (Hyvärinen et al., 2019; Khemakhem, Kingma and Hyvärinen, 2020) to further distinguish of S from Z. Besides, the p fy (y|[f -1 x ] S (x)) = p fy (y|[ f -1 x ] S (x)) further guarantees the identifiability of prediction: predict using f y (s) with s obtained from f x . The following theorem presents the ∼ p -identifiability for P exp : Theorem 4.3 (∼ p -identifiability). For θ in the LaCIM p e θ (x, y) ∈ P exp for any e ∈ E train , we assume that i) CIMe satisfies that f x , f x and f x are continuous and that f x , f y are bijective; ii) the T t i,j are twice differentiable for any t = s, z, i ∈ [q t ], j ∈ [k t ]; iii) the exogenous variables satisfy that the characteristic functions of ε x , ε y are almost everywhere nonzero. Under the diversity condition on A := [P d e 1 , ..., P d em ] ∈ R m×R with P d e := [p(c 1 |d e ), ..., p(c R |d e )] that the A and [Γ t=s,z c2 -Γ t=s,z c1 ] T , ..., [Γ t=s,z c R -Γ t=s,z c1 ] T T have full column rank for both t = s and t = z, we have that the θ := {f x , f y , T s , T z } are ∼ p identifiable. The bijectivity of f x and f y have been widely assumed in Janzing et al. (2009) ; Peters et al. (2014; 2017) ; Khemakhem, Kingma and Hyvärinen (2020) ; Teshima et al. (2020) as a basic condition for identifiability. It naturally holds for f x to be bijective since the latent components S, Z, as high-level abstractions which can be viewed as embeddings in auto-encoder (Kramer, 1991) , lies in lower-dimensional space compared with input X which is supposed to have more variations, i.e., (q s + q z < q x ). For categorical Y , the f y which generates the classification result, i.e., p(y = k|s) = [f y ] k (s)/ ( k [f y ] k (s)), will be shown later to be identifiable. The diversity condition implies that i) m ≥ R ≥ max(k z * q z , k s * q s ) + 1; and that ii) different environments are variant enough in terms of S-Z correlation (which is also assumed in Arjovsky et al. (2019) ), as a necessary for the invariant one to be identified. As noted in the formulation, a larger m would be easier to satisfy the condition, which agrees with the intuition that more environments can provide more complementary information for the identification of the invariant mechanisms. Remark 2. The dimensions of the ground-truth S, Z are unknown, making the check about whether m is large enough impossible. Besides, in some real applications, the training environments are passively observed and may not satisfy the condition. However, we empirically find the improvement of LaCIM in terms of both OOD prediction and interpretability, if the multiple environments provided are diverse enough. Besides, a training environment can be the mixture of many sub-environments, which motivates to splitting the data according to their source ID or clustering results (Teney et al., 2020) to obtain more environments, making the condition easier to satisfy. Extension to the general form of LaCIM. We generalize the identifiable result in theorem 4.3 to any LaCIM as long as its P(S, Z|C = c) ∈ W r,2 (S × Z) (for some r ≥ 2) and categorical Y , in the following theorem. This is accomplished by showing that any such LaCIM can be approximated by a sequence of distributions in P exp , motivated by the facts in Barron and Sheu (1991) that the exponential family is dense in the set of distributions with bounded support, and in Maddison et al. (2016) that the continuous variable with multinomial logit model can be approximated by a series of distributions with i.i.d Gumbel noise as the temperature converges to infinity. Theorem 4.4 (Asymptotic ∼ p -identifiability). Consider a LaCIM satisfying that p fx (x|s, z) and p fy (y|s) are smooth w.r.t s, z and s respectively. For each e and c ∈ C, suppose P e (S, Z|C = c) ∈ W r,2 (S × Z) for some r ≥ 2, we have that P is asymptotically ∼ p -identifiable defined as: ∀ > 0, ∃ ∼ p -identifiable Pθ ∈ P exp , s.t. d Pok (p e (x, y), pe θ (x, y)) < , ∀e ∈ E train , (x, y) ∈ X × Y 1 .

4.3. CAUSAL SUPERVISED VARIATIONAL AUTO-ENCODER

Guided by identifiability, we first provide the training method to learn f x , f y by reformulating VAE in a supervised scenario, followed by optimization over latent space for inference and test. Training. To learn the CIMe and p fx (x|s, z), p fy (y|s) for invariant prediction, we implement the generative model to fit {p e (x, y)} e∈Etrain , which has been guaranteed by theorem 4.3, 4.4 to be able to identify the ground-truth predicting mechanism. Specifically, we reformulate the objective of VAE, as a generative model proposed in (Kingma and Welling, 2014) , in supervised scenario. For unsupervised learning, the VAE introduces the variational distribution q ψ parameterized by ψ to approximate the intractable posterior by maximizing the following Evidence Lower Bound (ELBO): z|x) , as a tractable surrogate of maximum likelihood E p(x) log p φ (x). Specifically, the ELBO is less than and equal to E p(x) log p φ (x) and the equality can only be achieved when q ψ (z|x) = p φ (z|x). Therefore, maximizing the ELBO over p φ and q ψ will drive (i) q ψ (z|x) to learn p φ (z|x); (ii) p φ to learn the ground-truth model p (including p φ (x|z) to learn p(x|z)). -L φ,ψ = E p(x) E q ψ (z|x) log p φ (x,z) q ψ ( In our supervised scenario, we introduce the variational distribution q e ψ (s, z|x, y) and the corresponding ELBO for any e is -L e φ,ψ =E p e (x,y) E q e ψ (s,z|x,y) log p e φ (x,y,s,z) q e ψ (s,z|x,y) . Similarly, minimizing L e φ,ψ can drive p φ (x|s, z), p φ (y|s) to learn the CIMe (i.e. p fx (x|s, z), p fy (y|s)), and also q e ψ (s, z|x, y) to learn p e φ (s, z|x, y). In other words, the q ψ can inherit the properties of p φ . As p e φ (s, z|x, y) = p e φ (s,z|x)p φ (y|s) p e φ (y|x) for our DAG in Fig. 1 , we can similarly reparameterize q e ψ (s, z|x, y) as q e ψ (s,z|x)q ψ (y|s) q e ψ (y|x) . According to Causal Markov Condition, we have that p e φ (x, y, s, z) = p φ (x|s, z)p e φ (s, z)p φ (y|s). Substituting the above reparameterizations into the ELBO with q ψ (y|s) replaced by p φ (y|s), the L e φ,ψ can be rewritten as: L e φ,ψ = E p e (x,y) -log q e ψ (y|x) -E q e ψ (s,z|x) p φ (y|s) q e ψ (y|x) log p φ (x|s, z)p e φ (s, z) q e ψ (s, z|x) , where q e ψ (y|x) = S q e ψ (s|x)p φ (y|s)ds. The overall loss function is: Ours, m = 5) 0.81 0.86 0.81 0.87 0.85 0.87 0.73 0.78 0.86 0.87 0 LaCIM-d (Ours, m = 5) 0.64 0.82 0.75 0.80 0.76 0.83 0.79 0.90 0.75 0.85 0.74 ↑ 0.84 ↑  LaCIM-d (Ours, m = 7) 0.70 0.89 0.81 0.90 0.82 0.88 0.84 0.83 0.90 0.85 0.81 ↑ 0.87 ↑ model q e ψ (s, z|x) and generative models p φ (x|s, z), p φ (y|s) in Eq. ( 2). The generative models p φ (x|s, z), p φ (y|s) are shared among all environments, while the p e φ (s, z), q e ψ (s, z|x) are respectively p φ (s, z|d e s ), q ψ (s, z|x, d e s ) and p φ (s, z|d e ), q ψ (s, z|x, d e ) for LaCIM-d s and LaCIM-d. Inference & Test. When d e s can be acquired during test for e ∈ E test , we can predict y as arg max y p φ (y|x, d e s ) = q ψ (s|x, d e s )p φ (y|s)ds. Otherwise, for LaCIM-d with d s unobserved, we first optimize s, z via (s , z ) := arg max s,z log p φ (x|s, z) and predict y as arg max y q ψ (y|s ). Specifically, we adopt the strategy for optimization in Schott et al. (2018) that we first sample initial points and select the one with the maximum log p φ (x|s, z), then we optimize for 50 iterations using Adam. The implementation details and optimization effect are shown in supplement 7.9. L φ,ψ = e∈Etrain L e φ,ψ

5. EXPERIMENTS

We evaluate LaCIM on (I) synthetic data to verify the identifiability in theorem 4. 

5.1. SIMULATION

To verify the identifiability claim and effectiveness of our learning method, we implement LaCIM on synthetic data. The data generating process is provided in Supplement 7.8. The domain index D ∈ R m is denoted as a one-hot encoded vector with m = 5. To verify the utility of training on multiple domains (m > 1), we also conduct LaCIM by pooling data from all m domains together, namely pool-LaCIM for comparison. We randomly generate m = 5 datasets and run 20 times for each. We compute the metric mean correlation coefficient (MCC) adopted in Khemakhem, Kingma and Hyvärinen (2020) to measure the goodness of identifiability under permutation by introducing cost optimization to assign each learned component to the source component. This measurement is aligned with the goal of ∼ p -identifiability, which allows us to distinguish S from Z. Table 5 .1 shows the superiority of our LaCIM-d, LaCIM-d s over pool-LaCIM in terms of the CIMe relating to S, Z under permutation, by means of multiple diverse experiments. Besides, we consider LaCIM-d on m = 3, 5, 7 with the same total number of samples. It yields that more environments can perform better; and that even m = 3 still performs much better than pool-LaCIM. To illustrate the learning effect, we visualize the learned Z in Fig. 7 .8, with S left in supplement 7.8 due to space limit. 

5.2. REAL-WORLD OOD CHALLENGE

We present our LaCIM's results on three OOD tasks, with different environments associated with different values of d s . We implement both versions of LaCIM, i.e., LaCIM-d s and LaCIM-d, with task-dependent definition of d s . In CMNIST, the d s (digit color) is a fully observed confounder, and LaCIM-d s in this case is the ceiling of LaCIM-d under the same implementation. In NICO and ADNI, the LaCIM-d even outperform LaCIM-d s , when the source variables are only partially observed. Dataset. We describe the datasets as follows (the X denote image; the Y denote label): NICO: we evaluate the cat/dog classification in "Animal" dataset in NICO, a benchmark for non-i.i.d problem in He et al. (2019) . Each animal is associated with "grass","snow" contexts with different proportions, denoted as d s ∈ R 4 (cat,dog in grass,snow). We set m = 8 and m = 14. The C, Z, S respectively denote the (time,whether) of sampling, the context and semantic shape of cat/dog.

CMNIST:

We relabel the digits 0-4 and 5-9 as y = 0 and y = 1, based on MNIST. Then we color p e (1 -p e ) of images with y = 0 (y = 1) as green and color others as red. We set m = 2 with p e1 = 0.9, p e2 = 0.8. The d e s is p e to describe the intensity of spursiou correlation caused by color. We do not flip y with 25% like Arjovsky et al. ( 2019)foot_2 , since doing so will cause the digit correlated rather than causally related to the label, which is beyond our scope. The Z, S respectively represent the color and number. The C can also denote (time,whether) for which the painter draws the number and color, e.g., the painter tends to draw red 0 more often than green 1 in the sunny morning. ADNI. The data are obtained from the ADNI databaset, the Y := {0, 1, 2} with 0,1,2 respectively denoting AD, Mild Cognitive Impairment (MCI) and Normal Control (NC). The X is Magnetic resonance imaging (sMRI). We set m = 2. We consider two types of d s : Age and TAU (a biomarker Humpel and Hochstrasser (2011) ). The S (Z) denote the disease-related (-unrelated) brain regions. The C denotes the hormone level that can affect the brain structure development. Compared Baselines. We compare with (i) Cross-Entropy (CE) from X → Y (CE X → Y ), (ii) domain-adversarial neural network (DANN) for domain adaptation Ganin et al. (2016) Implementation Details. For each domain e, we implement the reparameterization with ρ e s , ρ e z : s , z = ρ e s (s), ρ e z (z), to transform the p e (s, z) into isotropic Gaussian; then the generative models are correspondingly modified as {p φ (x|(ρ e s ) -1 (s), (ρ e z ) -1 (z)), p φ (y|(ρ e s ) -1 (s))} according to rule of change of variables. The optimized parameters are {{q e ψ (s, z|x)} e , p φ (x|s, z), p φ (y|s), {ρ e t=s,z } e }, with the encoder q e ψ (s, z|x) being sequentially composed of: i) the sequential of Conv-BN-ReLU-MaxPool blocks that shared among E train , followed by ii) the sequential of ReLU-FC for the mean and log-variance of S, Z that are specific to e. The structure of ρ e t=s,z is FC-ReLU-FC. The decoder p φ (x|s, z) is the sequential of upsampling, several TConv-BN-ReLU blocks and Sigmoid. The predictor p φ (y|s) is sequential of FC→BN→ReLU blocks, followed by Softmax (or Sigmoid) for classification. The network structure and the output channel size for CMNIST, NICO and ADNI are introduced in supplement 7.11, 7.12, 7.13, Tab. 13, 14. We implement SGD as optimizer: with learning rate (lr) 0.5 and weight decay (wd) 1e-5 for CMNIST; lr 0.01 with decaying 0.2× every 60 epochs, wd 5e-5 for NICO and ADNI (wd is 2e-4). The batch-size are set to 256, 30 and 4 for CMNIST, NICO, ADNI. The "FC", "BN" stand for Fully-Connected, Batch-Normalization. Results. We report accuracy over three runs for each method. As shown in Tab. 2 3 our LaCIM-d performs comparable and better than others on all applications, except the 99.3 achieved by SDA on CMNIST, which is comparable to the result on the original MNIST. This is because during training, the SDA implemented data augmentation with random colors, which decorrelate the color-label. When S cannot be explicitly extracted in general case, the SDA is not tractable. Discussions. The advantage over invariant learning method (IRM) and CE (X, d s ) → Y which also takes d s into prediction can be contributed to the identification of true causal mechanisms. Further, the improvement over sVAE is benefited from our separation of ycausative factors (a.k.a, S) from others to avoid spurious correlation. Besides, as shown from results on NICO, a larger m (with the total number of samples n fixed) can bring further benefit, which may due to the easier satisfaction of the diversity condition in theorem 4.3. One thing worth particular mention is that on NICO and ADNI (when d s denotes TAU), our LaCIM-d performs comparable and even better than LaCIM-d s , due to the existence of unobserved partial variables. For example, each d s only contains one attribute each time in ADNI. For completeness, we conduct experiments with fully observed confounders in supplement 7.13. Besides, we apply our method on intervened data, the result of which can validate more robustness of LaCIM, as shown in supplement 7.12. Interpretability. We visualize learned S as side proof of interpretability. Specifically, we select the s * that has the highest correlation with y among all dimension of S, and visualize the derivatives of s * with respect to the image. For CE x → y and CE (x, d s ) → y, we visualize the derivatives of predicted class scores with respect to the image. As shown in Fig. 5 .2, LaCIM (the 4th column) can identify more explainable semantic features, which verifies the identifiability and effectiveness of the learning method. Supplement 7.12 provides more results. We consider the DeepFake-related security problem, which targets on detecting small perturbed fake images that can spread fake news. The Rossler et al. (2019) provides FaceForensics++ dataset from 1000 Youtube videos for training and 1,000 benchmark images from other sources (OOD) for testing. We split the train data into m = 2 environments according to video ID. The considerable result in Tab. 5.3 verifies potential value on security.

6. CONCLUSIONS & DISCUSSIONS

We incorporate the causal structure as prior knowledge in proposed LaCIM, by introducing: (i) latent variables and explicitly separate them into y-causative factors (a.k.a, S) and others (a.k.a, Z) which are spuriously correlated with the output; (ii) the source variable d s that explains the distributional inconsistency among domains. When the environments are diverse and much enough, we can successfully identify the causal invariant mechanisms, and also y-causative factors for prediction without a mix of others. Our LaCIM shows potential value regarding robustness to OOD tasks with confounding bias, selection bias and others such as healthcare and security. A possible drawback of our model lies in our requirement of the number of environments (which may be not satisfied in some scenarios) for identifiability, and the relaxation of which is left in the future work.  E p e 1 π x (S) g(S) -µ 1 < ∞ with µ 1 := E p e 1 [g(S)|X = x] = S g(s)p e1 (s|x)ds; then we have E p e 1 (y|x) -E p e 2 (y|x) ≤ g ∞ π x ∞ Var p e 1 (S|X = x). When e 1 ∈ E train and e 2 ∈ E test , the theorem 7.1 describes the error during generalization on e 2 for the strategy that trained on e 1 . The bound is mainly affected by: (i) the Lipschitz constant of g, i.e., g ∞ ; (ii) π x ∞ which measures the difference between p e1 (s, z) and p e2 (s, z); and (iii) the Var p e 1 (S|x) that measures the intensity of x → (s, z). These terms can be roughly categorized into two classes: (i),(iii) which are related to the property of CIMe and gave few space for improvement; and the (ii) that describes the distributional change between two environments. Specifically for the first class, the (i) measures the smoothness of E(y|s) with respect to s. The smaller value of g ∞ implies that the flatter regions give rise to the same prediction result, hence easier transfer from e 1 to e 2 and vice versa. For the term (iii), consider the deterministic setting that ε x = 0 (leads to Var p e 1 (S|x) = 0), then s can be determined from x for generalization if the f is bijective function. The term (ii) measures the distributional change between posterior distributions p e1 (s|x) and p e2 (s|x), which contributes to the difference during prediction: E p e 1 (y|x) -E p e 2 (y|x) = S (p e1 (s|x)p e1 (s|x))p fy (y|s)ds. Such a change is due to the inconsistency between priors p e1 (s, z) and p e2 (s, z), which is caused by different value of the confounder d s . Proof. In the following, we will derive the upper bound E p e 1 [Y |X = x] -E p e 2 [Y |X = x] ≤ g ∞ π x ∞ Var p e 1 (S|X = x) , where π x (s) =: p e 2 (s|x) p e 1 (s|x) and g(s) is assumed to be Lipschitz-continuous. To begin with, note that E[Y |X] = E[E(Y |X, S)|X] = E[g(S)|X] = g(s)p(s|x)ds. Let p 1 (s|x) = p e1 (s|x), p 2 (s|x) = p e2 (s|x). For ease of notations, we use P 1 and P 2 denote the distributions with densities p 1 (s|x) and p 2 (s|x) and suppose S 1 ∼ P 1 and S 2 ∼ P 2 , where x is omitted as the following analysis is conditional on a fixed X = x. Then we may rewrite the difference of conditional expectations as E p e 2 [Y |X = x] -E p e 1 [Y |X = x] = E(g(S 2 )) -E(g(S 1 )), where E[g(S j ))] = g(s)p j (s|x)ds denotes the expectation over P j . Let µ 1 := E p e 1 [g(S)|X = x] = E[g(S 1 )] = g(s)p 1 (s|x)ds. Then E p e 2 [Y |X = x] -E p e 1 [Y |X = x] = E(g(S 2 )) -E(g(S 1 )) = E [g(S 2 ) -µ 1 ] . Further, we have the following transformation E [g(S 2 ) -µ 1 ] = (g(s) -µ 1 )π x (s)p 1 (s|x)ds = E [(g(S 1 ) -µ 1 )π x (S 1 )] . (3) In the following, we will use the results of the Stein kernel function. Please refer to Definition 7.2 for a general definition. Particularly, for the distribution P 1 ∼ p 1 (s|x), the Stein kernel τ 1 (s) is τ 1 (s) = 1 p 1 (s|x) s -∞ (E(S 1 ) -t)p 1 (t|x)dt, where E(S 1 ) = s • p 1 (s|x)ds. Further, we define (τ 1 • g)(s) as (τ 1 • g)(s) = 1 p 1 (s|x) s -∞ (E(g(S 1 )) -g(t))p 1 (t|x)dt = 1 p 1 (s|x) s -∞ (µ 1 -g(t))p 1 (t|x)dt. (5) Under the second condition listed in Theorem 7.1, we may apply the result of Lemma 7.3. Specifically, by the equation ( 8), we have E [(g(S 1 ) -µ 1 )π x (S 1 )] = E [(τ 1 • g)(S 1 )π x (S 1 )] . Then under the first condition in Theorem 7.1, we can obtain the following inequality by Lemma 7.4, E [(τ 1 • g)(S 1 )π x (S 1 )]= E (τ 1 • g) τ 1 π x τ 1 (S 1 ) ≤ E (τ 1 • g) τ 1 (S 1 ) • π x τ 1 (S 1 ) ≤ g ∞ E [| (π x τ 1 ) (S 1 )|] ≤ g ∞ π x ∞ E [|τ 1 (S 1 )|] . In the following, we show that the Stein kernel is non-negative, which enables E [|τ 1 (S 1 )|] = E [τ 1 (S 1 )]. According to the definition, τ 1 (s) = 1 p1(s|x) s -∞ (E(S 1 ) -t)p 1 (t|x)dt, where E(S 1 ) = ∞ -∞ t • p 1 (t|x)dt. Let F 1 (s) = s -∞ p 1 (t|x)dt be the distribution function for P 1 . Note that s -∞ E(S 1 )p 1 (t|x)dt = F 1 (s)E(S 1 ) = F 1 (s) E(S 1 ), s -∞ tp 1 (t|x)dt = F 1 (s) s -∞ t p 1 (t|x) F 1 (s) dt = F 1 (s) E(S 1 |S 1 ≤ s) ≤ F 1 (s) E(S 1 ), The last inequality is based on E(S1|S1 ≤ s) -E(S1) ≤ 0 that can be proved as the following s -∞ t p 1 (t|x) F 1 (s) dt - ∞ -∞ tp 1 (t|x)dt = s -∞ t 1 F 1 (s) -1 p 1 (t|x)dt - ∞ s tp 1 (t|x)dt ≤ s s -∞ 1 F 1 (s) -1 p 1 (t|x)dt -s ∞ s p 1 (t|x) = 0. Therefore, τ 1 (s) ≥ 0 and hence E [|τ 1 (S 1 )|] = E [τ 1 (S 1 )] in (6). Besides, by equation ( 9), the special case of Lemma 7.3, we have E [τ 1 (S 1 )] = Var(S 1 ) = Var p e 1 (S|X = x). To sum up, E [(τ 1 • g)(S 1 )π x (S 1 )] ≤ g ∞ π x ∞ E [τ 1 (S 1 )] = g ∞ π x ∞ Var p e 1 (S|X = x). Definition 7.2 (the Stein Kernel τ P of distribution P ). Suppose X ∼ P with density p. The Stein kernel of P is the function x → τ P (x) defined by τ P (x) = 1 p(x) x -∞ (E(X) -y)p(y)dy, ( ) where Id is the identity function for Id(x) = x. More generally, for a function h satisfying E[|h(X)|] < ∞, define (τ P • h)(x) as (τ P • h)(x) = 1 p(x) x -∞ (E(h(X)) -h(y))p(y)dy. Lemma 7.3. For a differentiable function ϕ such that E[|(τ P • h)(x)ϕ (X)|] < ∞, we have E [(τ P • h)(x)ϕ (X)] = E[(h(X) -E(h(X))ϕ(X)]. (8) Proof. Let µ h =: E(h(X)). As E(h(X) -µ h ) = 0, (τ P • h)(x) = 1 p(x) x -∞ (µ h -h(y))p(y)dy = -1 p(x) ∞ x (µ h -h(y))p(y)dy. Then E [(τ P • h)(x)ϕ (X)] = 0 -∞ (τ P • h)(x)ϕ (x)p(x)dx + ∞ 0 (τ P • h)(x)ϕ (x)p(x)dx = 0 -∞ x -∞ (µ h -h(y))p(y)ϕ (x)dydx - ∞ 0 ∞ x (µ h -h(y))p(y)ϕ (x)dydx = 0 -∞ 0 y (µ h -h(y))p(y)ϕ (x)dxdy - ∞ 0 y 0 (µ h -h(y))p(y)ϕ (x)dxdy = 0 -∞ y 0 (h(y) -µ h )p(y)ϕ (x)dxdy + ∞ 0 y 0 (h(y) -µ h )p(y)ϕ (x)dxdy = ∞ -∞ (h(y) -µ h )p(y) y 0 ϕ (x)dx dy = ∞ -∞ (h(y) -µ h )p(y)(ϕ(y) -ϕ(0))dy = ∞ -∞ (h(y) -µ h )p(y)(ϕ(y))dy = E[(h(X) -E(h(X))ϕ(X)] Particularly, taking h(X) = X and ϕ(X) = X -E(X), we immediately have E(τ P (X)) = Var(X) Lemma 7.4. Assume that E(|X|) < ∞ and the density p is locally absolutely continuous on (-∞, ∞) and h is a Lipschitz continuous function. Then we have |f h | ≤ h ∞ for f h (x) = (τ P • h)(x) τ P (x) = x -∞ (E(h(X)) -h(y))p(y)dy x -∞ (E(X) -y)p(y)dy . Proof. This is a special case of Corollary 3.15 in Döbler et al. (2015) , taking the constant c = 1.

7.2. PROOF OF THE EQUIVALENCE OF DEFINITION 4.2

Proposition 7.5. The binary relation ∼ p defined in Def. 4.2 is an equivalence relation. Proof. The equivalence relation should satisfy three properties as follows: • Reflexive property: The θ ∼ p θ with M z , M s being identity matrix and a s , a z being 0. • Symmtric property: If θ ∼ p θ, then there exists block permutation matrices M z and M s such that T s ([f x ] -1 S (x)) = M s Ts ([ fx ] -1 S (x)) + a s , T z ([f x ] -1 Z (x)) = M z Tz ([ fx ] -1 Z (x)) + a z , p fy (y|[f x ] -1 S (x)) = p fy (y|[ fx ] -1 S (x)). The we have M -1 s and M -1 z are also block permutation matrices and such that: Ts ([ fx ] -1 S (x)) = M -1 s T s ([f x ] -1 S (x)) + (-a s ), Ts ([ fx ] -1 Z (x)) = M -1 z T s ([f x ] -1 Z (x)) + (-a z ), p fy (y|[ fx ] -1 S (x)) = p fy (y|[f x ] -1 S (x)). Therefore, we have θ ∼ p θ. • Transitive property: if θ 1 ∼ p θ 2 and θ 2 ∼ p θ 3 with θ i := {f i x , f i y , T s,1 , T z,1 , Γ s,i , Γ z,i }, then we have T s,1 ((f 1 x,s ) -1 (x)) = M 1 s T s,2 ((f 2 x,s ) -1 (x)) + a 1 s , T z,1 ((f 1 x,z ) -1 (x)) = M 1 z T z,2 ((f 2 x,z ) -1 (x)) + a 2 z , T s,2 ((f 2 x,s ) -1 (x)) = M 2 s T s,3 ((f 3 x,s ) -1 (x)) + a 2 s , T z,2 ((f 2 x,z ) -1 (x)) = M 2 z T z,3 ((f 3 z ) -1 (x)) + a 3 x,z for block permutation matrices M 1 s , M 1 z , M 2 s , M 2 z and vectors a 1 s , a 2 s , a 1 z , a 2 z . Then we have T s,1 ((f 1 x,s ) -1 (x)) = M 2 s M 1 s T s,3 ((f 3 x,s ) -1 (x)) + (M 2 s a 1 s ) + a 2 s , T z,1 ((f 1 x,z ) -1 (x)) = M 2 z M 1 z T z,3 ((f 3 x,z ) -1 (x)) + (M 2 z a 1 z ) + a 2 z . Besides, it is apparent that p f 1 y (y|(f 1 x ) -1 s (x)) = p f 2 y (y|(f 2 x ) -1 s (x)) = p f 3 y (y|(f 3 x ) -1 s (x)). Therefore, we have θ 1 ∼ p θ 3 since M 2 s M 1 s and M 2 z M 1 z are also permutation matrices. With above three properties satisfied, we have that ∼ p is a equivalence relation.

7.3. PROOF OF THEOREM 4.3

In the following, we write p e (x, y) as p(x, y|d e ) and also Γ t=s,z c := Γ t=s,z (d e ), S c,i = S i (d e ), Z c,i = Z i (d e ). To prove the theorem 4.3, we first prove the theorem 7.6 for the simplest case when c|d e = d e , then we generalize to the case when C := ∪ r C r . The overall roadmap is as follows: we first prove the ∼ A -identifiability in theorem 7.9, and the combination of which with lemma 7.12, 7.11 give theorem 7.6 in the simplest case when c|d e = d e . Then we generalize the case considered in theorem 7.6 to the more general case when C := ∪ r C r . Theorem 7.6 (∼ p -identifiability). For θ in the LaCIM p e θ (x, y) ∈ P exp for any e ∈ E train , we assume that (1) the CIMe satisfies that f x , f x and f x are continuous and that f x , f y are bijective; (2) that the T t i,j are twice differentiable for any t = s, z, i ∈ [q t ], j ∈ [k t ]; (3) the exogenous variables satisfy that the characteristic functions of ε x , ε y are almost everywhere nonzero; (4) the number of environments, i.e., m ≥ max(q s * k s , q z * k z ) + 1 and Γ t=s,z d e 2 -Γ t=s,z d e 1 , ..., Γ t=s,z d em -Γ t=s,z d e 1 have full column rank for both t = s and t = z, we have that the parameters θ := {f x , f y , T s , T z } are ∼ p identifiable. To prove theorem 7.6, We first prove the ∼ A -identifiability that is defined as follows: Definition 7.7 (∼ A -identifiability). The definition is the same with the one defined in 4.2, with M s , M z being invertible matrices which are not necessarily to be the permutation matrices in Def. 4.2. Proposition 7.8. The binary relation ∼ A defined in Def. 7.7 is an equivalence relation. Proof. The proof is similar to that of proposition 7.5. The following theorem states that any LaCIM that belongs to P exp is ∼ A -identifiable. Theorem 7.9 (∼ A -identifiability). For θ in the LaCIM p e θ (x, y) ∈ P exp for any e ∈ E train , we assume (1) the CIMe satisfies that f x , f y are bijective; (2) the T t i,j are twice differentiable for any t = s, z, i ∈ [q t ], j ∈ [k t ]; (3) the exogenous variables satisfy that the characteristic functions of ε x , ε y are almost everywhere nonzero; (4) the number of environments, i.e., m ≥ max(q s * k s , q z * k z ) + 1 and [Γ t d e 2 -Γ t d e 1 ] T , ..., [Γ t d em -Γ t d e 1 ] T T have full column rank for t = s, z, we have that the parameters {f x , f y , T s , T z } are ∼ p identifiable. Proof. Suppose that θ = {f x , f y , T s , T z } and θ = { fx , gy , Ts , Tz } share the same observational distribution for each environment e ∈ E train , i.e., p fx,fy,T s ,Γ s ,T z ,Γ z (x, y|d e ) = p fx, fy, Ts , Γs , Tz , Γz (x, y|d e ). (11)  =⇒ X p εx (x -x)p T s ,Γ s ,T z ,Γ z (f -1 x (x)|d e )volJ f -1 x (x)dx (14) = X p εx (x -x)p Ts , Γs , Tz , Γz ( f -1 x (x)|d e )volJ f -1 x (x)dx (15) =⇒ X pT s ,Γ s ,T z ,Γ z ,fx (x|d e )p εx (x -x)dx = X p =⇒ F [p T s ,Γ s ,T z ,Γ z ,fx ](ω)ϕ εx (ω) = F [p Ts , Γs , Tz , Γz , fx ](ω)ϕ εx (ω) (18) =⇒ F [p T s ,Γ s ,T z ,Γ z ,fx ](ω) = F [p Ts , Γs , Tz , Γz , fx ](ω) (19) =⇒ pT s ,Γ s ,T z ,Γ z ,fx (x|d e ) = p Ts , Γs , Tz , Γz , fx (x|d e ) where volJ f (X) := det(J f (X)) for any square matrix X and function f with "J" standing for the Jacobian. The pT s ,Γ s ,T z ,Γ z ,fx (x) in Eq. ( 16) is denoted as p T s ,Γ s ,T z ,Γ z (f -1 x (x|d e )volJ f -1 (x). The '*' in Eq. ( 17) denotes the convolution operator. The F [•] in Eq. ( 18) denotes the Fourier transform, where φ εx (ω) = F [p εx ](ω). Since we assume that the ϕ εx (ω) is non-zero almost everywhere, we can drop it to get Eq. ( 20). Similarly, we have that:  p fy,T s ,Γ s (y|d e ) = =⇒ V p ε (v -v)p T s ,Γ s ,T z ,Γ z (h -1 (v)|d e )volJ h -1 (v)dv (32) = V p ε (v -v)p Ts , Γs , Tz , Γz ( h-1 (v)|d e )volJ h-1 (v)dv (33) =⇒ S×Z pT s ,Γ s ,T z ,Γ z ,h,c (v|d)p ε (v -v)dv = S×Z p Ts , Γs , Tz , Γz , h,d e (v|d e )p ε (v -v)dv (34) =⇒ (p T s ,Γ s ,T z ,Γ z ,h * p ε )(v) = (p Ts , Γs , Tz , Γz , h * p ε )(v) (35) =⇒ F [p T s ,Γ s ,T z ,Γ z ,h ](ω)ϕ ε (ω) = F [p Ts , Γs , Tz , Γz , h](ω)ϕ ε (ω) (36) =⇒ F [p T s ,Γ s ,T z ,Γ z ,h ](ω) = F [p Ts , Γs , Tz , Γz , h](ω) (37) =⇒ pT s ,Γ s ,T z ,Γ z ,h (v) = pT s ,Γ s ,T z ,Γ z ,h (v), where v := [x , y ] , ε := [ε x , ε y ] , h(v) = [[f x ] -1 Z (x) , f -1 y (y) ] . According to Eq. ( 29), we have log volJ fy (y) + qs i=1   log B i (f -1 y,i (y)) -log A i (d e ) + ks j=1 T s i,j (f -1 y,i (y))Γ s i,j (d e )   = log volJ fy (y) + qs i=1   log Bi ( f -1 y,i (y)) -log Ãi (d e ) + ks j=1 T s i,j ( f -1 y,i (y)) Γs i,j (d e )   Suppose that the assumption (4) holds, then we have T s (f -1 y (y)), Γ s (d e k ) + i log A i (d e1 ) A i (d e k ) = Ts ( f -1 y (y)), Γs (d e k ) + i log Ãi (d e1 ) Ãi (d e k ) (40) for all k ∈ [m], where Γ(d) = Γ(d) -Γ(d e1 ). Denote bs (k) = i Ãi(d e 1 )Ai(d e k ) Ãi(d e k )Ai(d e 1 ) for k ∈ [m], then we have Γ s, T s (f -1 y (y)) = Γs, Ts ( f -1 y (y)) + bs , Similarly, from Eq. ( 20) and Eq. ( 38), there exists bz , bs such that Γ s, T s ([f x ] -1 S (x)) + Γ z, T z ([f x ] -1 Z (x)) = Γs, Ts ([ fx ] -1 S (x)) + Γz, Tz ([ fx ] -1 Z (x)) + bz + bs , where bz (k) = i Zi(d e 1 )Zi(d e k ) Zi(d e k )Zi(d e 1 ) for k ∈ [m]; and that, Γ s, T s (f -1 y (y)) + Γ z, T z ([f -1 x ] Z (x)) = Γs, Ts ( f -1 y (y)) + Γz, Tz ([ f -1 x ] Z (x)) + bz + bs . Substituting Eq. ( 41) to Eq. ( 42) and Eq. ( 43), we have that Γ z, T z ([f -1 x ] Z (y)) = Γz, Tz ([ f -1 x ] Z (y)) + bz , Γ s, T s ([f -1 x ] S (y)) = Γs, Ts ([ f -1 x ] S (y)) + bs . ( ) According to assumption (4), the Γ s, and Γ z, have full column rank. Therefore, we have that T z ([f -1 x ] Z (x)) = Γ z Γ z, -1 Γz, Tz ([ f -1 x ] Z (x)) + Γ z Γ z, -1 bz (45) T s ([f -1 x ] S (x)) = Γ s Γ s, -1 Γs, Ts ([ f -1 x ] S (x)) + Γ s Γ s, -1 bs . ( ) T s (f -1 y (y)) = Γ s Γ s, -1 Γs, Ts ( f -1 y (y)) + Γ s Γ s, -1 bs . ( ) Denote M z := Γ z Γ z, -1 Γz, , M s := Γ s Γ s, -1 Γs, and a s = Γ s Γ s, -1 bs , a z = Γ z Γ z, -1 bz . The left is to prove that M z and M s are invertible matrices. Denote x = f -1 (x). Applying the (Khemakhem, Kingma and Hyvärinen, 2020, Lemma 3) we have that there exists k s points x1 , ..., xks , x1 , ..., xkz such that (T s ) i ([f -1 x ] Si (x 1 i )), ..., (T s ) i ([f -1 x ] Si (x ks i )) for each i ∈ [q s ] and (T z ) i ([f -1 x ] Zi (x 1 i )), ..., (T z ) i ([f -1 x ] Si (x kz i )) for each i ∈ [q t ] are linearly independent. By differentiating Eq. ( 45) and Eq. ( 46) for each xi with i ∈ [q s ] and xi with i ∈ [q z ] respectively, we have that J T s (x 1 ), ..., J T s (x ks ) = M s J T s • f -1 x •fx (x 1 ), ..., J T s • f -1 x •f (x ks ) (48) J T z ( x1 ), ..., J T z ( xkz ) = M z J T z • f -1 x •fx ( x1 ), ..., J T z • f -1 x •fx ( xkz ) . The linearly independence of (T s ) i ([f -1 x ] Si (x 1 i )), ..., (T s ) i ([f -1 x ] Si (x ks i )) (T z ) i ([f -1 x ] Zi (x 1 i )), ..., (T z ) i ([f -1 x ] Si (x kz i )) imply that the J T s (x 1 ), ..., J T s (x ks ) and J T z ( x1 ), ..., J T z ( xkz ) are invertible, which implies the invertibility of matrix M s and M z . The rest is to prove p fy (y|[f x ] -1 S (x)) = p fy (y|[ fx ] -1 S (x) ). This can be shown by applying Eq. ( 31) again. Specifically, according to Eq. ( 31), we have that  X p εx (x -x)p(y|[f x ] -1 S (x))p T s ,Γ s ,T z ,Γ z (f -1 (x)|d e )volJ f -1 (x)dx = X p εx (x -x)p(y|[ fx ] -1 S (x))p T s ,Γ s ,T z ,Γ z ( f -1 (x)|d e )volJ f -1 (x)dx. ( ) Denote l T s ,Γ s ,T z ,Γ z ,fy,fx,y (x) := p fy (y|[f x ] -1 S (x))p T s ,Γ s ,T z ,Γ z (f -1 (x)|d e )volJ f -1 x (x), we have X p εx (x -x)l T s ,Γ s ,T z ,Γ z , (ω) = F [l T s ,Γ s ,T z ,Γ z ,fy,fx,y ](ω)ϕ εx (ω) (53) =⇒F [l T s ,Γ s ,T z ,Γ z ,fy,fx,y ](ω) = F [l Ts , Γs , Tz , Γz , fy, fx,y ](ω) =⇒l T s ,Γ s ,T z ,Γ z ,fy,fx,y (x) = l Ts , Γs , Tz , Γz , fy, fx,y (x) (55) =⇒p fy (y|[f x ] -1 S (x))p T s ,Γ s ,T z ,Γ z (f -1 (x)|d e )volJ f -1 x (x) = p fy (y|[ fx ] -1 S (x))p Ts , Γs , Tz , Γz ( f -1 (x)|d e )volJ f -1 x (x). ( ) Taking the log transformation on both sides of Eq. ( 56), we have that log p fy (y|[f x ] -1 S (x)) + log p T s ,Γ s ,T z ,Γ z (f -1 (x)|d e ) + log volJ f -1 x (x) = log p fy (y|[ fx ] -1 S (x)) + log p Ts , Γs , Tz , Γz ( f -1 (x)|d e ) + log volJ f -1 x (x). Subtracting Eq. ( 57) with y 2 from Eq. ( 57) with y 1 , we have p fy (y 2 |[f x ] -1 S (x)) p fy (y 1 |[f x ] -1 S (x)) = p fy (y 2 |[ fx ] -1 S (x)) p fy (y 1 |[ fx ] -1 S (x)) (58) =⇒ Y p fy (y 2 |[f x ] -1 S (x)) p fy (y 1 |[f x ] -1 S (x)) dy 2 = Y p fy (y 2 |[ fx ] -1 S (x)) p fy (y 1 |[ fx ] -1 S (x)) dy 2 (59) =⇒p fy (y 1 |[f x ] -1 S (x)) = p fy (y 1 |[ fx ] -1 S (x)), for any y 1 ∈ Y. This completes the proof. Understanding the assumption (4) in Theorem 7.9 and 7.6. Recall that we assume the confounder d s in LaCIM is the source variable for generating data in corresponding domain. Here we also use the C to denote the space of d s (since d s := c), then we have the following theoretical conclusion that the as long as the image set of C is not included in any sets with Lebesgue measure 0, the assumption (4) holds. This conclusion means that the assumption (4) holds generically. Theorem 7.10. Denote h t=s,z (d ) := Γ t 1,1 (d) -Γ t 1,1 (d e1 ), ..., Γ t qt,kt (d) -Γ t 1,1 (d e1 ) , h(C) := h s (S) ⊕ h z (Z) ⊂ R qz * kz ⊕ R qs * ks , then assumption (4) holds if h(C) is not included in any zero-measure set of R qz * kz ⊕ R qs * ks . Denote r s := q s * k s and r z := q z * k z . Proof. With loss of generality, we assume that r s ≤ r z . Denote Q as the set of integers q such that there exists d e2 , ..., d q+1 that the rank([h z (d e2 ), ..., h z (d eq+1 )]) = min(q, r z ) and rank([h s (d e2 ), ..., h s (d eq+1 )]) = min(q, r s ). Denote u := max(Q). We discuss two possible cases for u, respectively: • Case 1. u < r s ≤ r z . Then there exists d e2 , ..., d eu+1 s.t. h z (d e2 ), ..., h z (d eu+1 ) and h s (d e2 ), ..., h s (d eu+1 ) are linearly independent. Then ∀c, we have h z (d) ∈ L(h z (d e2 ), ..., h z (d eu+1 )) or h s (d) ∈ L(h s (d e2 ), ..., h s (d eu+1 )). Therefore, so we have h z (d) ⊕ h s (d) ∈ [L(h z (d e2 ), ..., h z (d eu+1 )) ⊕ R rs ] ∪ [R rz ⊕ L(h s (d e2 ), ..., h s (d eu+1 ))], which has measure 0 in R rz ⊕ R rs . • Case 2. r s ≤ u < r z . Then there exists d e2 , ..., d eu+1 s.t. h z (d e2 ), ..., h z (d eu+1 ) are linearly independent and rank([h s (d e1 ), ..., h s (d eu )]) = r s . Then ∀c, we have h z (d) ∈ L(h z (d e1 ), ..., h z (d eu+1 )), which means that h z (d) ⊕ h s (d) ∈ L(h z (d e1 ), ..., h z (d eu+1 )) ⊕ R rs , which has measure 0 in R rz ⊕ R rs . The above two cases are contradict to the assumption that h(C) is not included in any zero-measure set of R rz ⊕ R rs . Lemma 7.11. Consider the cases when k s ≥ 2. Then suppose the assumptions in theorem 7.9 are satisfied. Further assumed that • The sufficient statistics T s i,j are twice differentiable for each i ∈ [q s ] and j ∈ [k s ]. • f y is twice differentiable. Then we have M s in theorem 7.9 is block permutation matrix. Proof. Directly applying (Khemakhem, Kingma and Hyvärinen, 2020 , Theorem 2) with f x , A, b, T, x replaced by f y , M s , a s , T s , y. Lemma 7.12. Consider the cases when k s = 1. Then suppose the assumptions in theorem 7.9 are satisfied. Further assumed that • The sufficient statistics T s i are not monotonic for i ∈ [q s ]. • g is smooth. Then we have M s in theorem 7.9 is block permutation matrix. Proof. Directly applying (Khemakhem, Kingma and Hyvärinen, 2020 , Theorem 3) with f x , A, b, T, x replaced by f y , M s , a s , T s , y. Proof of Theorem 7.6. According to theorem 7.9, there exist invertible matrices M s and M z such that T(f -1 x (x)) = A T( f -1 x (x)) + b T s ([f -1 x ] S (x)) = M s Ts ([ f -1 x ] S (x)) + a s . T s (f -1 y (y)) = M s Ts ( f -1 y (y)) + a s , where T = [T s, , T z, ] , and A = M s 0 0 M z . ( ) By further assuming that the sufficient statistics T s i,j are twice differentiable for each i ∈ [q s ] and j ∈ [k s ] for k s ≥ 2 and not monotonic for k s = 1. Then we have that M s is block permutation matrix. By further assuming that T z i,j are twice differentiable for each i ∈ [n z ] and j ∈ [k z ] for k z ≥ 2 and not monotonic for k z = 1 and applying the lemma 7.11 and 7.12 respectively, we have that A is block permutation matrix. Therefore, M z is also a block permutation matrix. (62 ) Let ∆ x,y = [p θ (x, y|c 1 )-p θ (x, y|c 1 ), • • • , p θ (x, y|c m )-p θ (x, y|c m )] T , then Eq. ( 62) can be written as A∆ x,y = 0. Denote A := P d e 1 ∈ R m×R . According the diversity condition, we have that A and the [[Γ t (c 2 )-Γ t (c 1 )] T , ..., [Γ t (c m )-Γ t (c 1 ) ] T ] T have full column rank, therefore we have that ∆ x,y = 0, i.e. p θ (x, y|c r ) = p θ (x, y|c r ) for each r ∈ [R]. The left proof is the same with the one in theorem 7.6.

7.4. PROOF OF THEOREM 4.4

Proof of Theorem 4.4. Due to Eq. ( 62), it is suffices to prove the conclusion for every c r ∈ {c r } r∈ [R] . Motivated by Barron and Sheu (1991, Theorem 2 ) that the distribution p e (s, z) defined on bounded set can be approximated by a sequence of exponential family with sufficient statistics denoted as polynomial terms, therefore the T t=s,z are twice differentiable hence satisfies the assumption (2) in theorem 4.3 and assumption (1) in lemma 7.11. Besides, the lemma 4 in Barron and Sheu (1991) informs us that the KL divergence between p θ0 (s, z|c r ) (θ 0 := (f x , f y , T z , T s , Γ z 0 , Γ s 0 ) and p θ1 (s, z|c r ) (θ 1 := (f x , f y , T z , T s , Γ z 1 , Γ s 1 ) (the p θ0 (s, z|c r ), p θ1 (s, z|c r ) belong to exponential family with polynomial sufficient statistics terms) can be bounded by the  2 norm of [(Γ s (c r ) - Γ s 1 (c r )) , (Γ z 0 (c r ) -Γ z 1 (c r )) ] . Therefore, ∀ > 0, I n ∆ = p(x ∈ A, y ∈ B|c r ) -p n (x ∈ A, y n ∈ B|c r ) → 0, ( ) where p n (x ∈ A, y n ∈ B|c r ) = S Z p(x ∈ A|s, z)p(y n ∈ B|s)p n (s, z|c r )dsdz with y n (i) = exp((f y,i (s) + ε y,i )/T n ) i exp((f y,i (s) + ε y,i )/T n ) , i = 1, ..., k, for y ∈ R k denoting the k-dimensional one-hot vector for categorical variable and ε y,1,..., k are Gumbel i.i.d. According to (Maddison et al., 2016 , Proposition 1) that the y n (i) d → y(i) with p(y(i) = 1) = exp(f y,i (s)) i exp((f y,i (s)) , as T n → 0. ( ) As long as f y is smooth, we have that the p(y n |s) is continuous. We have that In,3 I n = p(x ∈ A, y ∈ B|c r ) - S×Z p(x ∈ A|s, z)p(y n ∈ B|s)p n (s, z|c r )dsdz ≤ p(x ∈ A, y ∈ B|c r ) -p(x ∈ A, y n ∈ B|c r ) + p(x ∈ A, y n ∈ B|c r ) - + (Ms×Mz) cr p(x ∈ A|s, z)p(y n ∈ B|s) (p(s, z|c r ) -p n (s, z|c r )) In,4 . (66) For I n,1 , if y is itself additive model with y = f y (s) + ε y , then we just set y n d = y, then we have that I n,1 = 0. Therefore, we only consider the case when y denotes the categorical variable with softmax distribution, i.e., Eq. ( 65). ∀c r ∈ C := {c 1 , ..., c R } and ∀ > 0, there exists M cr s and M cr z Besides, we have that I n,2 ≤ Ms×Mz 2 p(s, z|c r )dsdz ≤ 2 . Therefore, we have that S×Z p(x ∈ A|s, z) (p(y ∈ B|s) -p(y n ∈ B|s)) p(s, z|c r )dsdz → 0 as n → ∞. For I n,3 , we have that such that p(s, z ∈ M cr s × M cr z |c r ) ≤ ; Denote M s ∆ = ∪ m k=1 M cr s and M z ∆ = ∪ m k=1 M cr z , we have that p(s, z ∈ M s × M z |c) ≤ 2 for all c r ∈ C. Since ∀s 1 ∈ M s , ∃N s1 such that ∀n ≥ N s1 , I n,3 = Ms×Mz p(x ∈ A|s, z)p(y n ∈ B|s)1(s, z ∈ M s × M z ) (p(s, z|c r ) -p n (s, z|c r )) dsdz ≤ Ms×Mz p(x ∈ A|s, z)p(y n ∈ B|s)p(s, z|c r ) 1 p(s, z ∈ M s × M z |c r ) -1 dsdz In,3,1 + Ms×Mz p(x ∈ A|s, z)p(y n ∈ B|s)p(s, z|c r ) 1 p(s, z ∈ M s × M z |c r ) -1 dsdz In,3,2 . ( ) The I n,3,1 ≤ 1-. Denote p(s, z|c r ) := p(s,z|cr)1(s,z∈Ms×Mz) p(s,z∈Ms×Mz|cr) , according to (Barron and Sheu, 1991, Theorem 2) , there exists a sequence of p n (s, z|c) defined on a compact support M s × M z such that ∀c r ∈ C, we have that p n (s, z|c r ) d → p(s, z|c r ). Applying again the Heine-Borel theorem, we have that ∀ , ∃N such that ∀n ≥ N , we have p(s, z|c r ) -p n (s, z|c r ) ≤ , which implies that I n,3,2 → 0 as n → ∞ combining with the fact that p(x, y|s, z) is continuous with respect to s, z. For I n,4 , we have that I n,4 = Ms×Mz p(x ∈ A|s, z)p(y n ∈ B|s)p(s, z|c r ) ≤ Ms×Mz p(s, z|c r ) ≤ , where the first equality is from that the p n (s, z|c r ) is defined on M s × M z . Then we have that S×Z p(x ∈ A|s, z)p(y n ∈ B|s) (p(s, z|c r ) -p n (s, z|c)) → 0, as n → ∞. ( ) The proof is completed.

7.5. REPARAMETERIZATION FOR LACIM-d

We provide an alternative training method to avoid parameterization of prior p(s, z|d e ) to increase the diversity of generative models in different environments. Specifically, motivated by Hyvärinen and Pajunen (1999) that any distribution can be transformed to isotropic Gaussian with the density denoted by p Gau , we have that for any e ∈ E train , we have p e (x, y) = S×Z p fx (x|s, z)p fy (y|s)p(s, z|d e )dsdz = S×Z p(x|(ρ e s ) -1 (s ), (ρ e z ) -1 (z ))p(y|ρ s (s ))p Gau (s , z )ds dz , with s , z := ρ e s (s), ρ e z (z) ∼ N (0, I). We can then rewrite ELBO for LaCIM-d for environment e as: L e φ,ψ,ρ e = E p e (x,y) -log q e ψ (y|x) + E p e (x,y) -E q e ψ (s,z|x) q ψ (y|(ρ e s ) -1 (s)) q e ψ (y|x) log p φ ((ρ e s ) -1 (s), (ρ e z ) -1 (z))p Gau (s, z) q e ψ (s, z|x) . (72)

7.6. IDENTIFIABILITY

Earlier works that identify the latent confounders rely on strong assumptions regarding the causal structure, such as the linear model from latent to observed variable or ICA in which the latent component are independent Silva et al. (2006) , or noise-free model Shimizu et al. (2009) ; Davies (2004) . The Hoyer et al. (2008) ; Janzing, Peters, Mooij and Schölkopf (2012) extend to the additive noise model (ANM) and other causal discovery assumptions. Although the Lee et al. (2019) relaxed the constraints put on the causal structure, it required the latent noise is with small strength, which does not match with many realistic scenarios, such as the structural MRI of Alzheimer's Disease considered in our experiment. The works which also based on the independent component analysis (ICA), i.e., the latent variables are (conditionally) independent, include Davies (2004) ; Eriksson and Koivunen (2003) ; recently, a series of works extend the above results to deep nonlinear ICA (Hyvarinen and Morioka, 2016; Hyvärinen et al., 2019; Khemakhem, Kingma and Hyvärinen, 2020; Khemakhem, Monti, Kingma and Hyvärinen, 2020; Teshima et al., 2020) . However, these works require that the value of confounder of these latent variables is fixed, which cannot explain the spurious correlation in a single dataset. In contrast, our result can incorporate these scenarios by assuming that each sample has a specific value of the confounder. Other works assume discrete distribution for latent variables, such as Janzing, Sgouritsa, Stegle, Peters and Schölkopf (2012) ; Kocaoglu et al. (2018) ; Sgouritsa et al. (2013) . However, in the literature, no existing works can disentangle the prediction-causative features from others, in the scenario of avoiding spurious correlation in order for OOD generalization. Kingma and Hyvärinen (2020) ; Ilse et al. (2020; 2019) assumed Y → S(X) as the causal direction. Such an difference from ours can mainly be contributed to the generating process of Y . Different understanding leads to different causal graph. The example of digital hand-writing in Peters et al. (2017) provides a good explanation. Consider the case that the writer is provided with a label first (such as "2") before writing the digit (denoted as X), then it should be Y → X. Consider another case, when the writing is based on the incentive (denoted as S) of which digit to write, then the writer record the label Y and the digit X concurrently, in which case it should be X ← S → Y . For Y → S, the Y is thought to be the source variable that generates the latent components and is observed before X. In contrast, we define Y as ground-truth labels given by humans. Taking image classification as an example, it is the human that give the classification of all things such as animals. In this case, it can be assumed that the label given by humans are ground-truth labels. This assumption can be based by the work Biederman (1987) in the field of psychology that humans can factorize the image X by many components due to the powerful perception learning ability of human beings. These components which denoted as S, can be accurately detected by humans, therefore we can approximately assume that it is the S generating the label Y . Consider the task of early prediction in Alzheimer's Disease, the disease label is given based on the pathological analysis and observed after the MRI X. Such a labelling outcome can be regarded as the ground-truth which itself is defined by medical science. The corresponding pathology features, as the evidences for labelling, can also thought as the generators of X. In these cases, it is more appropriate to assume the Y as the outcome than the cause. For example, the Peters et al. (2016) ; Kuang et al. (2018) assumed X S → Y . As an adaptation to sensory-level data such as image, we assume S → Y with S are latent variables to model high-level explanatory factors, which coincides with existing literature Teshima et al. (2020) . Another difference lies in the definition of Y . The Invariant Risk Minimization (we will give a detailed comparison later) Arjovsky et al. (2019) assumes that X → S → Y by defining the Y as the label with noise. The S denoted as the extracted hidden components by observer.

7.7.2. COMPARISONS WITH DATA AUGMENTATION & ARCHITECTURE DESIGN

The goal of data augmentation Shorten and Khoshgoftaar ( 2019) is increase the variety of the data distribution, such as geometrical transformation Kang et al. (2017) ; Taylor and Nitschke (2017) , flipping, style transfer Gatys et al. (2015) , adversarial robustness Madry et al. (2017) . On the other way round, an alternative kind of approaches is to integrate into the model corresponding modules that improve the robustness to some types of variations, such as Worrall et al. (2017) ; Marcos et al. (2016) . However, these techniques can only make effect because they are included in the training data for neural network to memorize Zhang et al. (2016) ; besides, the improvement is only limited to some specific types of variation considered. As analyzed in Xie et al. (2020) ; Krueger et al. (2020) , the data augmentation trained with empirical risk minimization or robust optimization Ben-Tal et al. (2009) such as adversarial training Madry et al. (2017) ; Sagawa et al. (2019) can only achieve robustness on interpolation (convex hull) rather than extrapolation of training environments.

7.7.3. COMPARISONS WITH EXISTING WORKS IN DOMAIN ADAPTATION

Apparently, the main difference lies in the problem setting that (i) the domain adaptation (DA) can access the input data of the target domain while ours cannot; and (ii) our methods need multiple training data while the DA only needs one source domain. For methodology, our LaCIM shares insights but different with DA. Specifically, both methods assume some types of invariance that relates the training domains to the target domain. For DA, one stream is to assume the same conditional distribution shared between the source and the target domain, such as covariate shift Huang et al. (2007); Ben-David et al. (2007) ; Johansson et al. (2019) ; Sugiyama et al. (2008) in which P (Y |X) are assumed to be the same across domains, concept shift Zhang et al. (2013) in which the P (X|Y ) is assumed to be invariant. Such an invariance is related to representation, such as Φ(X) in Zhao et al. (2019) and P (Y |Φ(X)) in Pan et al. (2010) ; Ganin et al. (2016) ; Magliacane et al. (2018) . However, these assumptions are only distribution-level rather than the underlying causation which takes the data-generating process into account. Taking the image classification again as an example, our method first propose a causal graph in which the latent factors are introduced as the explanatory/causal factors of the observed variables. These are supported by the framework of generative model Khemakhem, Kingma and Hyvärinen (2020) ; Khemakhem, Monti, Kingma and Hyvärinen (2020) ; Kingma and Welling (2014) ; Suter et al. (2019) which has natural connection with the causal graph Schölkopf (2019) that the edge in the causal graph reflects both the causal effect and also the generating process. Until now, perhaps the most similar work to us are Romeijn and Williamson (2018) and Teshima et al. (2020) which also need multiple training domains and get access to a few samples in the target domain. Both work assumes the similar causal graph with us but unlike our LaCIM, they do not separate the latent factors which can not explain the spurious correlation learned by supervised learning Ilse et al. (2020) . Besides, the multiple training datasets in Romeijn and Williamson (2018) refer to intervened data which may hard to obtain in some applications. We have verified in our experiments that explicitly disentangle the latent variables into two parts can result in better OOD prediction power than mixing them together.

7.7.4. COMPARISONS WITH DOMAIN GENERALIZATION

For domain generalization (DG), similar to the invariance assumption in DA, a series of work proposed to align the representation Φ(X) that assumed to be invariant across domains Li et al. (2017; 2018) ; Muandet et al. (2013) . As discussed above, these methods lack the deep delving of the underlying causal structure and precludes the variations of unseen domains. Recently, a series of works leverage causal invariance to enable OOD generalization on unseen domains, such as Ilse et al. (2019) which learns the representation that is domain-invariant. Notably, the Invariant Causal Prediction Peters et al. (2016) formulates the assumption in the definition of Structural Causal Model and assumes that Y = X S β S + ε Y where ε Y satisfies Gaussian distribution and S denotes the subset of covariates of X. The Rojas-Carulla et al. (2018); Bühlmann (2018) relaxes such an assumption by assuming the invariance of f y and noise distribution ε y in Y ← f y (X S , ε y ) which induces P (Y |X S ). The similar assumption is also adopted in Kuang et al. (2018) . However, these works causally related the output to the observed input, which may not hold in many real applications in which the observed data is sensory-level, such as audio waves and pixels. It has been discussed in Bengio et al. (2013) ; Bengio (2017) that the causal factors should be high-level abstractions/concepts. The Heinze-Deml and Meinshausen (2017) considers the style transfer setting in which each image is linear combination of shape-related variable and contextual-related variable, which respectively correspond to S and Z in our LaCIM in which the nonlinear mechanism (rather than linear combination in Heinze-Deml and Meinshausen (2017)) is allowed. Besides, during testing, our method can generalize to the OOD sample with intervention such as adversarial noise and contextual intervention. Recently, the most notable work is Invariant Risk Minimization Arjovsky et al. (2019) , which will be discussed in detail in the subsequent section. The Invariant Risk Minimization (IRM) Arjovsky et al. (2019) assumes the existence of invariant representation Φ(X) that induces the optimal classifier for all domains, i.e., the E[Y |P a(Y )] is domain-independent in the formulation of SCM. Similar to our LaCIM, the P a(Y ) can refer to latent variables. Besides, to identify the invariance and the optimal classifier, the training environments also need to be diverse enough. As aforementioned, this assumption is almost necessary to differentiate the invariance mechanism from the variant ones. To learn such an invariance, a regularization function is proposed. The difference of our LaCIM with IRM lies in two aspects: the direction of causal relation and the methodology. For the direction, as aforementioned in section 7.7.1, the IRM assumes X → S rather than the S, Z → X in our LaCIM. This is because the IRM defines Y as label with noise while ours definie the Y as the ground-truth label hence should be generated by the ground-truth hidden components that generating S. Such an inconsistency can be reflected by experiment regarding to the CMNIST in which the number is the causal factors of the label Y , rather than only invariant correlation. Besides, in terms of methodology, the theoretical claim of IRM only holds in linear case; in contrast, the CIMe f x , f y are allowed to be nonlinear. Some other works share the similar spirit with or based on IRM. The Risk-Extrapolation (REx) Krueger et al. (2020) Implementation Details We parameterize p θ (s, z|d), q φ (s, z|x, y, d), p θ (x|s, z) and p θ (y|s) as 3layer MLP with the LeakyReLU activation function. The Adam with learning rate 5 × 10 -4 is implemented for optimization. We set the batch size as 512 and run for 2,000 iterations in each trial. Visualization. As shown from the visualization of S is shown in Fig. 7 .8, our LaCIM can identify the causal factor S. We first sample some initial points from each posterior distribution q e (s|x) and then optimize for 50 iterations. We using Adam as optimizer, with learning rate as 0.002 and weight decay 0.0002. The Fig. 7 .9 shows the optimization effect of one run in CMNIST. As shown, the test accuracy keeps growing as iterates. For time saving, we chose to optimize for 50 iterations. Figure 5 : The optimization effect in CMNIST, starting from the point with initial sampling from inference model q of each branch. As shown, the test accuracy increases as iterates.

7.10. IMPLEMENTATIONS FOR BASELINE

For the CE X → Y and the CE X, d s → Y , they both composed of two parts: (i) feature extractor, followed by (ii) classifier. The network structure of the feature extractor for CE X → Y is the same with that of our encoder; while the extracted features for CE X, d → Y is the concatenation of the features encoded from X → S, Z via the network with the same network structure of our encoder; and the network with the same structure of our prior network for LaCIM-d. The network structures of the classifier for both methods are the same to that of our p φ (y|s). The IRM and SDA adopt the same structure as CE X → Y . DANN adopt the same structure of CE X → Y and a additional domain classifier which is the same as that of p φ (y|s). sVAE adopt the same structure as LaCIM-d s with the exception that the p φ (y|s) is replaced by p φ (y|z, s). MMD-AAE adopt the same structure of encoder, decoder and classifier as LaCIM-d and a additional 2-layer MLP with channel 256-256-dim z is used to extract latent z. The detailed number of parameters and channel size on each dataset for each method are summarized in Tab. 13, 14.

Implementation details

The network structure for inference model is composed of two parts, with the first part shared among all environments and multiple branches corresponding to each environment for the second part. The network structure of the first-part encoder is composed of four blocks, each block is the sequential of Convolutional Layer (Conv), Batch Normalization (BN), ReLU and maxpooling with stride 2. The output number of feature map is accordingly 32, 64, 128, 256. The second part network structure that output the mean and log-variance of S, Z is Conv-bn-ReLU(256) → Adaptive (1) → FC(256, 256) → ReLU → FC(256, q t=s,z ) with FC stands for fully-connected layer. The structure of ρ t=s,z in Eq. ( 72) is FC(q t , 256) → ReLU → FC(256, q t ). The network structure for generative model p φ (x|s, z) is the sequential of three modules: (i) Upsampling with stride 2; (ii) four blocks of Transpose-Convolution (TConv), BN and ReLU with respective output dimension being 128, 64, 32, 16; (iii) Conv-BN-ReLU-Sigmoid with number of channels in the output as 3, followed by cropping step in order to make the image with the same size as input dimension, i.e., 3 × 28 × 28. The network structure for generative model p φ (y|s) is commposed of FC (512) → BN → ReLU → FC (256) → BN → ReLU → FC (|Y|). The q t=s,z is set to 32. We implement SGD as optimizer with learning rate 0.5, weight decay 1e -5 and we set batch size as 256. The total training epoch is 80. We first explain why we do not flip y with 25% in the manuscript, and then provide further exploration of our method for the setting with flipping y. Invariant Causation v.s. Invariant Correlation by Flipping y in Arjovsky et al. (2019) The y is further flipped with 25% to obtain the final label in IRM setting and this step is omitted in ours. The difference lies in the definition of invariance. Our LaCIM defines invariance as the causal relation between S and the label Y , while the one in IRM can be correlation. As illustrated in Handwritting Sample Form in Fig. 7 .11 in Grother (1995) , the generting direction should be Y → X. If we denote the variable by flipping Y as Ỹ (a.k.a, the final label in IRM), then the causal graph should be X ← Y → Ỹ . In this case, the Ỹ is correlated rather than causally related to the digit X. For our LaCIM, we define the label as interpretable human label (which can approximate to y for any image x) and represented by Y in our experiments. The reason why we do not define the Y as ground-truth label is that (i) the prediction is only based on the extracted components of image which may be determined not only by the ground-truth label; (ii) the learning of ground-truth is interpretable that relevant to human. For example, if a writer is provided with digit "2" but he wrote it mistakenly as "4", then it is more interpretable that we can predict the digit as "4" rather than "2". For the digit with ambiguous label from the perspective of image, even if we predict it mistakenly, it is also interpretable in terms of prediction given the information of only digit. Returning back to the IRM setting, the label is flipping without reference to the semantic shape of digit. Therefore, the flipping may happen to noiseless digits rather than noisy and unsure ones, making the shape of number less semantically related to the label.

Experiment with IRM setting

We further conduct the experiment on IRM setting, with the final label y defined by flipping original label with 25%, and further color p e proportions of digits with corresponding color-label mapping. If we assume the original ground-truth label to be the effect of the digit number of S, then the anti-causal relation with Z and Y can make the identifiability of S difficult in this flipping scenario. Note that the causal effect between S and Y is invariant across domains, therefore we adopt to regularize the branch of inferring S to be shared among inference models for multiple environments. Besides, we regularize the causal effect between S and Z to be shared among different environments via pairwise regularization. The combined loss is formulated as: Lψ,φ = L ψ,φ + Γ 2m 2 m i=1 m j=1 E (x,y)∼p e i (x,y) [y|x] -E (x,y)∼p e j (x,y) [y|x] 2 2 , with q e ψ (s, z|x) in Eq. ( 72) factorized as q ψ e z (z)q ψs (s) and ρ s shared among m environments. The appended loss is coincide with recent study Risk-Extropolation (REx) in Krueger et al. (2020) , with the difference of separating y-causative factors S from others. We name such a training method as LaCIM-REx. For implementation details, in addition to shared encoder regarding S, we set learning rate as 0.1, weight decay as 0.0002, batch size as 256. we have that p(y|x) = S q ψs (s|x)p φ (y|ρ s (s)) for any x. We consider two settings: setting#1 with m2 and p e1 = 0.9, p e2 = 0.8; and setting#2 with m = 4 with p e1 = 0.9, p e2 = 0.8, p e3 = 0.7, p e4 = 0.6. We only report the number of IRM since the cross entropy performs poorly in both settings. As shown, our model performs comparably with LaCIM-d s and better than IRM Arjovsky et al. (2019) due to separation of S znd Z. is not included in the convex hull of {d ei } 14 i=1 . More Visualization Results Fig. 7 shows more visualization results. Results on Intervened Data. We test our model and the baseline on intervened data, in which each image is generated by intervention on Z, i.e., taking a specific value of Z. This intervention breaks the correlation between S and Z, thus the distribution of which can be regarded as a specific type of OOD. Specifically, we replace the scene of an image with the scene from the another image, as shown in Fig. 8 . We generate 120 images, including 30 images of types: cat on grass, dog on grass, cat on snow, and dog on grass. We evaluate LaCIM-d, CE X → Y , IRM, DANN, NCBB, MMD-AAE, and DIVA methods on this intervened dataset. As shown in Tab 9, our LaCIM-d can performs the best among all methods, which validate the robustness of our LaCIM. Denotation of Attributes d s . The C ∈ R 9 includes personal attributes (e.g., age Guerreiro and Bras (2015) , gender Vina and Lloret (2010) and education years Mortimer (1997) that play as potential Implementation Details For LaCIM-d s , we parameterize inference model q ψ (s, z|x, d s ), p φ (s, z|d s ), p φ (x|z, s) and p φ (y|s) and S, Z ∈ R 64 . For q ψ (s, z|x, d s ), we concatenate outputs of feature extractors of X and d s : the feature extractor for x is composed of four Convolution-Batch The d s variable in training and test. The selected attributes include Education Years, Age, Gender (0 denotes male and 1 denotes female), AV45, amyloid β and TAU. We split the data into m = 2 training environments and test according to different value of d s . The Tab. 7.13 describes the data distribution in terms of number of samples, the value of d s (Age and TAU).

7.13.1. EXPERIMENTS WITH COMPLETE OBSERVABLE SOURCE VARIABLE

In image-based diagnosis, the personal attributes, genes and biomarkers are often available. Therefore, we consider the setting when d s can be fully observed. In this case, the value of d s is person-byperson. Therefore, the number of environments m is equal to the number of samples. In this case, the dataset turns to {x i , y i , d i s } n i=1 . The expected risk turns to: L ψ,φ = E p(x,y|ds) -log q ψ (y|x, d s ) -E q ψ (s,z|x,ds) p φ (y|s) q ψ (y|x, d s ) log p φ (x|s, z)p φ (s, z|d s ) q ψ (s, z|x, d s ) . (73) And the corresponding empirical risk is: Lψ,φ = 1 n -log q ψ (y i |x i , d s,i ) -E q ψ (s,z|xi,ds,i) q ψ (y i |s) q ψ (y i |x i , d s,i ) log p φ (x i |s, z)p φ (s, z|d s,i ) q ψ (s, z|x i , d s,i ) . (74) The d s here is re-defined as the 9-dimensional vector that includes all attributes, genes and biomarkers mentioned above. We re-split the data into 80% train and 20% test, according to different average value of specific variable in the whole vector d s . The 



(t1), ..., T t q t (tq t )] ∈ R k t ×q t T t i (ti) := [T t i,1 (ti), ..., T t i,k t (ti)], ∀i ∈ [qt] , The d Pok denotes the Pokorov distance and limn→∞ d Pok (µn, µ) → 0 ⇐⇒ µn d → µ. We also conduct this experiment with flipping y in supplementary 7.11. On NICO, we implement ConvNet with Batch Balancing as a specifically benchmark inHe et al. (2019). The results are 60 ± 1 on m = 8 and 62.33 ± 3.06 on m = 14.



Figure 1: The DAG for LaCIM. The variables marked by white (gray) color represent the unobserved (observed) variables. Each arrow represents the causal effect from the variable it points from on the one it points to. The C denotes the confounder of S, Z, which are the causal factors of X, Y . The D denotes the domain index, which varies across domains and characterizes the distribution of C.

3; (II) OOD challenges: object classification with sample selection bias (Non-I.I.D. Image dataset with Contexts (NICO)); Hand-Writing Recognition with confounding bias (Colored MNIST (CMNIST)); prediction of Alzheimer's Disease (Alzheimer's Disease Neuroimaging Initiative (ADNI www.loni.ucla. edu/ADNI); (III) Robustness on detecting images with small perturbation (FaceForensics++).

Figure 2: Visualization of Z. From left to right are: estimated posterior by pool-LaCIM, LaCIM-d s , LaCIM-d and the ground-truth. As shown, the LaCIM-d s (Fig.(b)) and LaCIM-d (Fig.(c)) can identify the ground truth distribution of Z (i.e., p φ (z|d s )) up to permutation and point-wise transformation, which validates the claim in theorem 4.3.

, (iii) Maximum Mean Discrepancy with Adversarial Auto-Encoder (MMD-AAE) for domain generalization Li et al. (2018), (iv) Domain Invariant Variational Autoencoders (DIVA) Ilse et al. (2019) (v) Selecting Data Augmentation (SDA) Ilse et al. (2020), (vi) Invariant Risk Mnimization (IRM) Arjovsky et al. (2019), (vii) CE (X, d s ) → Y , (viii) VAE with causal graph C → V → {X, Y } with V mixing S, Z and we call it sVAE for simplicity. We only implement SDA on CMNIST, since the intervened-data generation of SDA requires explicitly extracting the S, Z, which is intractable in ADNI and NICO. For fair comparison, we keep the model capacity (numer of parameters) in the same level.

Figure 3: Visualization via gradient Simonyan et al. (2013). From the left to right: original image, CE X → Y , CE (X, d s ) → Y and LaCIM-d s .

Proof of Theorem 4.3. We consider the general case when C := ∪ R r=1 C r , in which each C r can be simplified as a representative point c r . For environment d e , let P d e = [P(C = c 1 |d e ), • • • , P(C = c R |d e )] be the vector of probability mass of C in the environment d e . And E train has m environments with indexes d e1 , • • • , d em . The latent factors (S, Z) belongs to the exponential family distribution p(s, z|c) = p T z ,Γ z (d) (z)p T s ,Γ s (d) (s). Suppose that θ = {f x , f y , T s , T z } and θ = { fx , gy , Ts , Tz } share the same observational distribution for each environment, i.e., p θ (x, y|d e ) = p θ (x, y|d e ), then we have that R r=1 p θ (x, y|c r ) P(C = c r |d e ) = R r=1 p θ (x, y|c r ) P(C = c R |d e ).

there exists a open set of Γ(c r ) such that the D KL (p(s, z|c r ), p θ (s, z|c r )) < . Such an open set is with non-zero Lebesgue measurement therefore can satisfy the assumption (4) in theorem 4.3, according to result in theorem 7.10. The left is to prove that for any p defined by a LaCIM following Def. 4.1, there is a sequence of {p m } n ∈ P exp such that the d Pok (p, p n ) → 0 that is equivalent to p n d → p. For any A, B, we consider to prove that

S×Z p(x ∈ A|s, z)p(y n ∈ B|s)p n (s, z|c r )dsdz = S×Z p(x ∈ A|s, z) (p(y ∈ B|s) -p(y n ∈ B|s)) p(s, z|c r )dsdz + S×Z p(x ∈ A|s, z)p(y n ∈ B|s) (p(s, z|c r ) -p n (s, z|c r )) ≤ Ms×Mz p(x ∈ A|s, z) (p(y ∈ B|s) -p(y n ∈ B|s)) p(s, z|c r )dsdz ∈ A|s, z) (p(y ∈ B|s) -p(y n ∈ B|s)) p(s, z|c r )dsdz In,2 + Ms×Mz p(x ∈ A|s, z)p(y n ∈ B|s) (p(s, z|c r ) -p n (s, z|c r ))

we have that p(y ∈ B|s 1 ) -p(y ∈ B|s 1 )| ≤ from that y n d → y. Besides, there exists open set O s1 such that ∀s ∈ O s1 and p(y ∈ B|s 1 ) -p(y ∈ B|s 1 )| ≤ , p(y n ∈ B|s 1 ) -p(y n ∈ B|s 1 )| ≤ . Again, according to Heine-Borel theorem, there exists finite s, namely s 1 , ..., s l such that M s ⊂ ∪ l i=1 O(s i ). Then there exists N ∆ = max{N s1 , ..., N s l } such that ∀n ≥ N , we have that p(y ∈ B|s) -p(y n ∈ B|s) ≤ 3 , ∀s ∈ M s . (67) Therefore, I n,1 ≤ Ms×Mz 3 p(x ∈ A|s, z)p(s, z|c)dsdz ≤ 3 . Hence, I n,1 → 0 as n → ∞.

7.7 COMPARISON WITH EXISTING WORKS 7.7.1 Y → S OR S → Y ? Many existing works Rojas-Carulla et al. (2018); Khemakhem, Monti,

7.7.5 COMPARISONS WITH INVARIANT RISK MINIMIZATION ARJOVSKY ET AL. (2019) AND REFERENCES THERE IN

proposed to enforce the similar behavior of m classifiers with variance of which proposed as the regularization function. The work in Xie et al. (2020) proposed a Quasidistribution framework that can incorporate empirical risk minimization, robust optimization and ∼ p fx (x|s, z)p fy (y|s)p e (s, z|d e s )dsdz. The d e s = N (0, I q ds ×q ds ) + 5 * e * 2; the s, z|d e s ∼ N µ φ s,z (s, z|d e s ), σ 2 φ s,z (s, z|d e s ) with µ φ s,z = A µ s,z * d e s and log σ φ s,z = A σ s,z * d e s (A µ s,z , A σ s,z are random matrices); the x|s, z ∼ N µ φ x (x|s, z), σ 2 φ x (x|s, z) with µ φ s,z = h(A µ,3 x * h(A µ,2 x * h(A µ,2 x * [s , z ] ]))) and log σ φ s,z = h(A σ,3 x * h(A σ,2 x * h(A σ,2 x * [s , z ] ]))) (h is LeakyReLU activation function with slope = 0.5 and A µ,i=1,2,3 x ,A σ,i=1,2,3 x are random matrices); the y|s is similarly to x|s, z with A µ,i=1,2

Figure 4: Visualization of S. From left to right are: estimated posterior by pool-LaCIM: p pool-LaCIM (s|x, y), by LaCIM with c as input: p LaCIM (s|x, y, d s ), by LaCIM with D as input: p LaCIM (s|x, y, d); the ground-truth p φ (s|d s ).

(Ours, m = 5) 0.73 0.85 0.70 0.89 0.85 0.91 0.81 0.84 0.83 0.93 0.78 ↑ 0.89 ↑ LaCIM-ds (Ours, m = 7) 0.92 0.90 0.83 0.90 0.84 0.93 0.85 0.94 0.83 0.90 0.86 ↑ 0.91 ↑ 7.9 IMPLEMENTATION DETAILS FOR OPTIMIZATION OVER S, Z Recall that we first optimize s * , z * according to s * , z * = arg max s,z log p φ (x|s, z).

Figure 6: Hand-writting Sample Form. The writer print the digit/character (i.e., X) with the label (i.e., Y ) provided first.

7.13 DISEASE PREDICTION OF ALZHEIMER'S DISEASEDataset Description. The dataset contains in total 317 samples with 48 AD, 75 NC, and 194 MCI.

Figure 7: Visualization on the NICO via gradient-based method Simonyan et al. (2013) for CE X → Y , CE (X, d s ) → Y and LaCIM. The selected images are (a) cat on grass, (b) cat on snow, (c) dog on grass and (d) dog on snow.

MCC of identified latent variables. Average over 20 times for each data.

Accuracy (%) of OOD prediction. Average over three runs. Ours) 62.00 ± 1.73 68.00 ± 2.64 98.81 ± 0.14 65.08 ± 1.59 66.14 ± 0.91 LaCIM-d (Ours) 62.67 ± 0.58 68.67 ± 2.64 98.78 ± 0.20 64.44 ± 0.96 68.23 ± 0.90

Accuracy (%) of robustness on Face-Forensics++. Average over three runs.

7.1 O.O.D GENERALIZATION ERROR BOUND Denote E p [y|x] := Y yp(y|x)dy for any x, y ∈ X ×Y. We have E p e [y|s] = Y yp(y|s)dy according to that p(y|s) is invariant across E, we can omit p e in E p e [y|s] and denote g(S) := E[Y |S].Then, the OOD bound E p e 1 (y|x) -E p e 2 (y|x) , ∀(x, y) is bounded as follows: Theorem 7.1 (OOD genearlization error). Consider two LaCIM P e1 and P e2 , suppose that their densities , i.e., p e1 (s|x) and p e2 (s|x) are absolutely continuous having support (-∞, ∞).

Then we have p fx,fy,T s ,Γ s ,T z ,Γ z (x|d e ) = p fx, fy, Ts , Γs , Tz , Γz (x|d e ) (12) =⇒ S×Z p fx (x|s, z)p T s ,Γ s ,T z ,Γ z (s, z|d e )dsdz =

Ts , Γs , Tz , Γz , fx (x|d e )p εx (x -x)dx (16) =⇒ (p T s ,Γ s ,T z ,Γ z ,fx * p εx )(x|d e ) = (p Ts , Γs , Tz , Γz , fx ) * p εx (x|d e ) (17)

p fy, Ts , Γs (y|d e )

MCC of identified latent variables for p e (x, y) = p(x|s, z)p(y|s)p(s, z|c)p(c|d e )dcdsdz. Average over 20 times for each data.

Accuracy (%) of Colored MNIST on IRM setting inArjovsky et al. (2019). Average over three runs.

Training and test environments (characterized by d s ) cat% on grass dog% on grass cat% on snow cat% on snow

Comparison on constructed interventional dataset in terms of ACC. The training environments which is characterized by c can be referenced in Table7.12. For visualization, we implemented the gradient-based methodSimonyan et al. (2013) to visualize the neuron (in fully connected layer for both CE x → y and CE (x, d s ) → y; in s layer for LaCIM-d s ) that is most correlated to label y.The d s for m environments We summarize the d s of m = 8 and m = 14 environments in Table7.12. As shown, the value of d s in the test domain is the extrapolation of the training environments, i.e., the d test

Comparison on constructed interventional dataset in terms of ACC.

Training and test environments (characterized by c) in early prediction of AD

d s variable in training and test We implemented OOD tasks in which the value of d s is different between training and test. Specifically, we repeatedly split the dataset into the training and the test according to a selected attribute in d s for three times. The average value of these attributes in train and test are recorded in Table7.13.1.Experimental ResultsWe conduct OOD experiments with source variables Age, Gender, amyloid β and TAU different between training data and the test. The results are shown in Table12.7.14 SUPPLEMENTARY FOR DEEPFAKEImplementation Details. We implement data augmentations, specifically images with 30 angle rotation, with flipping horizontally with 50% probability. We additionally apply random compressing techniques, such as JpegCompression. For inference model, we adopt Efficient-B5Tan and Le (2019), with the detailed network structure as: FC(2048, 2048) → BN → ReLU → FC(2048, 2048) → BN → ReLU → FC(2048, q t=s,z ). The structure of reparameterization, i.e., ρ t=s,z is FC(q t=s,z , 2048) General framework table for our method and baselines on Data ∈ {CMNIST, NICO, ADNI, DeepFake} Dataset. We denote the dimension of z or zs as dim z,zs . We list the output dimension (e.g. the channel number) of each module, if it is different from the one in Tab. 14.

annex

REx. It can be concluded that the robust optimization only generalizes the convex hull of training environments (defined as interpolation) and the REx can generalize extrapolated combinations of training environments. This work lacks model of underlying causal structure, although it performs similarly to IRM experimentally. Besides, the Teney et al. (2020) proposed to unpool the training data into several domains with different environment and leverages Arjovsky et al. (2019) to learn invariant information for classifier. Recently, the Bellot and van der Schaar (2020) also assumes the invariance to be generating mechanisms and can generalize the capability of IRM when unobserved confounder exist. However, this work also lacks the analysis of identifiability result.We finish this section with the following summary of methods in section 7.7.4 and the IRM, in terms of causal factor, invariance type, direction of causal relation, theoretical judgement and the ability to generalize to intervened data. 

Data Generation

We set m = 5, n e = 1000 for each e. The generating process ofis introduced in the supplement 7.8. We set q ds = q s = q z = q y = 2 and q x = 4. For each environment e ∈ [m] with m = 5, we generate 1000 samples The decoders p φ (x|s, z) are p φ (y|s) parameterized by Deconvolutional neural network. For all methods, we train for 200 epochs using SGD with weight decay 2 × 10 -4 and learning rate 0.01 and is multiplied by 0.2 after every 60 epochs. The batch size is set to 4. For each variable in biomarker vector C ∈ R 9 , each person may have multiple records, and we take its median as representative to avoid extreme values due to device abnormality.As for LaCIM-d, we adopt the same decoder p φ (x|z, s) and classifier p φ (y|s). For q ψ (s, z|x, d), we adopt the same network for the shared part; for the part specific to each domain, µ s,z (x, d) and log σ s,z (x, d) are generated by the sub-network which is composed of 1024 FC-BNR → 1024 FC-BNR → q z,s FC-BNR. The z, s can be reparameterized by µ s,z (x, d) and log σ s,z (x, d) are fed into a sub-network which is composed of q z,s FC-BNR → 1024 FC-BNR → q z,s FC-BNR to get rid of the constraint of Gaussian distribution. Then the reconstructed images and predicted label are computed by p φ (x|z, s) and p φ (y|s) which have the same network structure of LaCIM-C with the z, s. 61.9 ± 0.0 66.7 ± 1.6 63.0 ± 0.9 67.7 ± 0.9 66.1 ± 3.3 66.1 ± 1.8 DANN 62.4 ± 0.9 62.4 ± 0.9 63.0 ± 1.8 64.6 ± 0.9 67.2 ± 0.9 66.1 ± 0.9 CE (X, ds) → Y 67.2 ± 1.8 66.7 ± 3.2 63.0 ± 1.8 66.1 ± 3.3 66.1 ± 1.8 64.0 ± 0.9 sVAE 67.2 ± 0.9 67.2 ± 0.9 67.2 ± 0.9 65.6 ± 1.8 66.7 ± 2.7 65.1 ± 1.6 LaCIM-ds (Ours) 69.8 ± 1.6 68.8 ± 0.9 69.8 ± 1.6 69.3 ± 1.8 67.7 ± 0.9 67.7 ± 0.0

OOD source

Age Gender CE X → Y 63.6 ± 2.6 65.6 ± 6.0 64.8 ± 4.7 60.5 ± 0.9 60.5 ± 1.8 60.5 ± 0.9 DANN 60.8 ± 1.8 58.7 ± 0.0 58.7 ± 0.0 58.5 ± 1.5 61.5 ± 0.0 60 ± 1.5 CE (X, ds) → Y 60.4 ± 2.9 64.5 ± 2.4 64.4 ± 3.8 63.2 ± 0.9 65.6 ± 1.8 64.1 ± 0.9 sVAE 58.2 ± 0.9 60.0 ± 1.8 58.7 ± 1.6 64.1 ± 0.9 65.6 ± 1.8 64.1 ± 0.9 LaCIM-ds (Ours) 64.0 ± 2.4 70.4 ± 2.4 66.1 ± 3.7 65.6 ± 0.9 67.2 ± 1.8 68.2 ± 0.9

Method

ACC (%) Setting#1 Setting#2 Setting#3 Setting#1 Setting#2 Setting#3 OOD source amyloid β TAU CE X → Y 59.2 ± 0.9 63.5 ± 4.2 63.1 ± 5.1 64.6 ± 0.9 64.1 ± 0.0 66.0 ± 1.1 DANN 60.8 ± 0.9 60.8 ± 0.9 60.8 ± 0.9 64.6 ± 0.9 65.1 ± 0.9 64.6 ± 0.9 CE (X, ds) → Y 64.6 ± 1.8 64.6 ± 3.7 64.2 ± 2.4 64.6 ± 0.9 66.7 ± 0.9 67.0 ± 1.3 sVAE 66.1 ± 0.9 64.6 ± 0.9 63.5 ± 3.2 68.2 ± 0.9 68.8 ± 2.7 67.2 ± 1.6 LaCIM-ds (Ours) 68.3 ± 1.6 66.1 ± 1.8 65.6 ± 2.4 69.8 ± 0.9 71.4 ± 1.8 68.8 ± 0.0 followed by cropping the image to the same size 3 × 224 × 224. We set q t=s,z as 1024. We implement SGD as optimizer, with learning rate 0.02, weight decay 0.00005, and run for 9 epochs.

