INVARIANT CAUSAL REPRESENTATION LEARNING

Abstract

Due to spurious correlations, machine learning systems often fail to generalize to environments whose distributions differ from the ones used at training time. Prior work addressing this, either explicitly or implicitly, attempted to find a data representation that has an invariant causal relationship with the outcome. This is done by leveraging a diverse set of training environments to reduce the effect of spurious features, on top of which an invariant classifier is then built. However, these methods have generalization guarantees only when both data representation and classifiers come from a linear model class. As an alternative, we propose Invariant Causal Representation Learning (ICRL), a learning paradigm that enables out-ofdistribution generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifiers). It builds upon a practical and general assumption: data representations factorize when conditioning on the outcome and the environment. Based on this, we show identifiability up to a permutation and pointwise transformation. We also prove that all direct causes of the outcome can be fully discovered, which further enables us to obtain generalization guarantees in the nonlinear setting. Extensive experiments on both synthetic and real-world datasets show that our approach significantly outperforms a variety of baseline methods.

1. INTRODUCTION

In recent years, despite various impressive success stories, there is still a significant lack of robustness in machine learning algorithms. Specifically, machine learning systems often fail to generalize outside of a specific training distribution, because they usually learn easier-to-fit spurious correlations which are prone to change from training to testing environments. We illustrate this point by considering the widely used example of classifying images of camels and cows (Beery et al., 2018) . The training dataset has a selection bias, i.e., most pictures of cows are taken in green pastures, while most pictures of camels happen to be in deserts. After training a convnet on this dataset, it is found that the model fell into the spurious correlation, i.e., it related green pastures with cows and deserts with camels, and therefore classified green pastures as cows and deserts as camels. The result is that the model failed to classify images of cows when they are taken on sandy beaches. To address the aforementioned problem, a natural idea is to identify which features of the training data present domain-varying spurious correlations with labels and which features describe true correlations of interest that are stable across domains. In the example above, the former are the features describing the context (e.g., pastures and deserts), whilst the latter are the features describing animals (e.g., animal shape). Arjovsky et al. (2019) suggest that one can identify the stable features and build invariant predictors on them by exploiting the varying degrees of spurious correlation naturally present in training data collected from multiple environments. The authors proposed the invariant risk minimization (IRM) approach to find data representations for which the optimal classifier is invariant across all environments. Since this formulation is a challenging bi-leveled optimization problem, the authors proved the generalization of IRM across all environments by constraining both data representations and classifiers to be linear (Theorem 9 in Arjovsky et al. (2019) ). Ahuja et al. (2020) studied the problem from the perspective of game theory, with an approach that we call IRMG for short. They showed that the set of Nash equilibria for a proposed game are equivalent to the set of invariant predictors for any finite number of environments, even with nonlinear data representations and nonlinear classifiers. However, these theoretical results in the nonlinear setting only guarantee that one can learn invariant predictors from training environments, but do not guarantee that the learned invariant predictors can generalize well across all environments including unseen testing environments. In fact, the authors directly borrowed the linear generalization result from Arjovsky et al. (2019) and presented it as Theorem 2 in Ahuja et al. (2020) . In this work we propose an alternative learning paradigm, called Invariant Causal Representation Learning (ICRL), which enables out-of-distribution (OOD) generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifier). We first introduce a practical and general assumption: the data representation factorizes (i.e., its components are independent of each other) when conditioning on the outcome (e.g., labels) and the environment (represented as an index). This assumption builds a bridge between supervised learning and unsupervised learning, leading to a guarantee that the data representation can be identified up to a permutation and pointwise transformation. We then theoretically show that all the direct causes of the outcome can be fully discovered. Based on this, the challenging bi-leveled optimization problem in IRM and IRMG can be reduced to two simpler independent optimization problems, that is, learning the data representation and learning the optimal classifier can be performed separately. This further enables us to attain generalization guarantees in the nonlinear setting. Contributions We propose Invariant Causal Representation Learning (ICRL), a novel learning paradigm that enables OOD generalization in the nonlinear setting. (i) We introduce a conditional factorization assumption on data representation for the OOD generalization (Assumption 1). (ii) Base on this assumption, we show that each component of the representation can be identified up to a permutation and pointwise transformation (Theorem 1, 2 & 3). (iii) We further prove that all the direct causes of the outcome can be fully discovered (Proposition 1). (iv) We show that our approach has generalization guarantees in the nonlinear setting (Proposition 2). (v) Empirical results demonstrate that our approach significantly outperforms IRM and IRMG in the nonlinear scenarios.

2.1. IDENTIFIABLE VARIATIONAL AUTOENCODERS

A general issue with variational autoencodersfoot_0 (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) is the lack of identifiability guarantees of the deep latent variable model. In other words, it is generally impossible to approximate the true joint distribution over observed and latent variables, including the true prior and posterior distributions over latent variables. Consider a simple latent variable model where O ∈ R d stands for an observed variable (random vector) and X ∈ R n for a latent variable. Khemakhem et al. (2020) showed that any model with unconditional latent distribution p θ (X) is unidentifiable. That is, we can always find transformations of X which change its value but do not change its distribution. Hence, the primary assumption that they make to obtain an identifiability result is to include a conditionally factorized prior distribution over the latent variables p θ (X|U ), where U ∈ R m is an additionally observed variable (Hyvarinen et al., 2019) . More specifically, let θ = (f , T , λ) ∈ Θ be the parameters of the conditional generative model: p θ (O, X|U ) = p f (O|X)p T ,λ (X|U ), where p f (O|X) = p (O -f (X)) in which is an independent noise variable with probability density function p ( ), and the prior probability density function is especifically given by p T ,λ (X|U ) = i Q i (X i )/Z i (U ) • exp k j=1 T i,j (X i )λ i,j (U ) , where Q i is the base measure, Z i (U ) the normalizing constant, T i = (T i,1 , . . . , T i,k ) the sufficient statistics, λ i (U ) = (λ i,1 (U ), . . . , λ i,k (U )) the corresponding parameters depending on U , and k the dimension of each sufficient statistic that is fixed in advance. It is worth noting that this assumption is not very restrictive as exponential families have universal approximation capabilities (Sriperumbudur et al., 2017) . As in VAEs, we maximize the corresponding evidence lower bound: L iVAE (θ, φ) :=E p D E q φ (X|O,U ) [log p θ (O, X|U ) -log q φ (X|O, U )] , where we denote by p D the empirical data distribution given by dataset D = O (i) , U (i) N i=1 . This approach is called identifiable VAE (iVAE). Most importantly, it can be proved that iVAE can identify latent variables X up to a permutation and pointwise transformation under the conditions stated in Theorem 2 of (Khemakhem et al., 2020) . Arjovsky et al. (2019) introduced invariant risk minimization (IRM), whose goal is to construct an invariant predictor f that performs well across all environments E all by exploiting the varying degrees of spurious correlation naturally present in the training data collected from multiple environments E tr , where E tr ⊆ E all . Technically, they consider datasets D e := {(o e i , y e i )} ne i=1 from multiple training environments e ∈ E tr , where o e i ∈ O ⊆ R d is the input observation and its corresponding labelfoot_1 is y e i ∈ Y ⊆ R s . The dataset D e , collected from environment e, consists of examples identically and independently distributed according to some probability distribution P (O e , Y e ). The goal of IRM is to use these multiple datasets to learn a predictor Y = f (O) that achieves the minimum risk for all the environments. Here we define the risk reached by f in environment e as

2.2. INVARIANT RISK MINIMIZATION

R e (f ) = E O e ,Y e [ (f (O e ), Y e )]. Then, the invariant predictor can be formally defined as below, Definition 1 (Invariant Predictor, Arjovsky et al. (2019) ). We say that a data representation Φ ∈ H Φ : O → C elicits an invariant predictor w • Φ across environments E if there is a classifier w ∈ H w : C → Y simultaneously optimal for all environments, that is, w ∈ arg min w∈Hw R e ( w • Φ) for all e ∈ E. Mathematically, IRM can be phrased as the following constrained optimization problem: 

3.1. A MOTIVATING EXAMPLE

In this section, we extend the example which was introduced by Wright (1921) and discussed by Arjovsky et al. (2019) , and provide a further in-depth analysis. Model 1. Consider the following structural equation model (SEM): X 1 ← Gaussian(0, σ 1 (e)), Y ← X 1 + Gaussian(0, σ 2 (e)), X 2 ← Y + Gaussian(0, σ 3 (e)), where σ i (e) ≥ 0, varying in environment e ∈ E all , and E all is the set of all environments. To ease exposition, here we consider the simple scenario in which E all only contains all modifications varying the noises of X 1 , X 2 and Y within a finite range, i.e., σ i (e) ∈ [0, σ 2 max ]. Then, to predict Y from (X 1 , X 2 ) using a least-square predictor Ŷ e = α1 X e 1 + α2 X e 2 for environment e, we can • Case 1: regress from X e 1 , to obtain α1 = 1 and α2 = 0, • Case 2: regress from X e 2 , to obtain α1 = 0 and α2 = In general scenarios (i.e., σ 1 (e) = 0, σ 2 (e) = 0, and σ 3 (e) = 0), the regression using X 1 in Case 1 is an invariant correlation: this is the only regression whose coefficients do not vary with the environment e. By contrast, the regressions in both Case 2 and Case 3 have varying coefficients depending on the environment e. Not surprisingly, only the invariant correlation in Case 1 would generalize well to new test environments. X 1 Y X 2 E (a) X 1 Y X 2 E O (b) Xp 1 • • • Xp r Y Xc 1 • • • Xc k E O (c) From a practical perspective, let us take a closer look at Case 3. Because we do not know in advance that regressing from X 1 alone will lead to an invariant predictor, in practice we may do the regression from all the accessible data (X e 1 , X e 2 ). As aforementioned, when σ i (e) = 0 for i = 1, 2, 3, the regression does not work. Actually any empirical risk minimization (ERM) algorithm purely minimizing training error (Vapnik, 1992) would not work in this setting. Invariant Causal Prediction (ICP) methods (Peters et al., 2015) also do not work, since the noise variance in Y may change across environments. To this end, Arjovsky et al. (2019) proposed IRM. As aforementioned, however, IRM and IRMG can generalize well to unseen testing environments only in the linear setting. This motivates us to develop an approach to enabling the OOD generalization in the nonlinear setting (i.e., both Φ and w are from the class of nonlinear models). A more straightforward way to understand the motivating example is in its corresponding graphical representationfoot_2 , as shown in Fig. 1a . Following Peters et al. (2015) , we treat the environment as a random variable E, where E could be any information specific to the environment. For simplicity, we let E be the environment index, i.e., E ∈ {1, . . . , N } and N is the number of training environments. Note that, here we consider E as a surrogate variable because it itself is not a causal variable (Zhang et al., 2017; Huang et al., 2020) . From Fig. 1a , it is obvious to see that ICP does not work in this setting, since Y ⊥ ⊥ E|X 1 . In fact, a more practical version appearing in real problems is present in Fig. 1b , where the true variables {X 1 , X 2 } are unobserved and we only can observe their transformation O, which is a function of {X 1 , X 2 }. In this case, even if Y is not affected by E (i.e., remove the edge E → Y ), applying ICP to O still does not work, since each variable (i.e., each dimension) of O is jointly influenced by both X 1 and X 2 . By contrast, both IRM and IRMG work when the transformation is linear, but not when it is nonlinear. These analyses are also empirically demonstrated in Section 5.1.

3.2. THE GENERAL SETTING

Inspired by the two dimensional example (i.e., X = (X 1 , X 2 ) ∈ R 2 ) described above, we naturally extend it to a more general multi-dimensional settingfoot_3 as shown in Fig. 1c . Technically, we have O ∈ O ⊆ R d , Y ∈ Y ⊆ R s , X = (X p1 , . . . , X pr , X c1 , . . . , X c k ) ∈ X ⊆ R n(r+k) (lower- dimensional, n(r +k) ≤ d) , where we assume each X i ∈ R n for simplicity. It is worth emphasising that except X pr , none of {X p1 , . . . , X pr-1 } has an arrow pointed to Y , meaning that all the parents of Y are absorbed into X pr . Under this circumstance, more formally, we assume that Assumption 1. X i ⊥ ⊥ X j |Y , E for any i = j. This assumption is not very restrictive but practically and theoretically reasonable enough to be able to cover various scenarios, due to the following reasons: Firstly, Assumption 1 allows us to separately deal with each X i . When taking a closer look at each X i in Fig. 1c , it is evident that there exist only five possible connections between X i , Y , E, and O, as shown in Fig. 2a . Among them, only the arrow from X i to O must exist whilst the other four might not necessarily, which leads to 12 possible types of structures present in Figs. 2b-2m . These structures cover most scenarios in real-world applications, e.g., X i could be either a parent or a child of Y , be either affected or not by E, and even have no connection to Y , etc. It is also worth noting that Assumption 1 does not rule out the possibility that there might exist certain correlations between all these latent variables. X i Y E O 3 4 2 1 5 (a) X i Y E O (b) X i Y E O (c) X i Y E O (d) X i Y E O (e) X i Y E O (f) X i Y E O (g) X i Y E O (h) X i Y E O (i) X i Y E O (j) X i Y E O (k) X i Y E O (l) X i Y E O (m) Secondly, although it is well known that the key idea behind learning disentangled representations is that real-world data is generated by a few explanatory factors of variation, Locatello et al. (2019) show that unsupervised learning of disentangled representations is fundamentally impossible without inductive biases, and that well-disentangled models seemingly cannot be identified without supervision. They further suggest that disentanglement learning should be explicit about the role of inductive biases and supervision. In this sense, our Assumption 1 reasonably implements this idea. Thirdly, as described in Section 2.1, identifiability of latent variables in iVAE requires a conditionally factorized prior distribution over the latent variables, as we did in Assumption 1, which is a key condition under which it is guaranteed that latent variables can be identified up to a permutation and pointwise transformation. This condition further inspires us to develop a generalization theory in the nonlinear setting that we will formulate in Section 4.3. Now, under Assumption 1, solving Eq. ( 4) can be reduced to the key question of how to find the subset of X that are the direct causes of Y , from the observational data {O, E, Y }, which will be discussed in the next section.

4. OUR APPROACH

In this section, we formally introduce our algorithm, namely Invariant Causal Representation Learning (ICRL), which consists of three phases and is summarized in Algorithm 1. The basic idea is that we first identify true latent variables by leveraging iVAE under Assumption 1 (Phase 1), then discover direct causes of Y (Phase 2), and finally learn an invariant predictor based on the identified direct causes (Phase 3).

4.1. PHASE 1: IDENTIFYING TRUE LATENT VARIABLES USING IVAE

Under Assumption 1, it is straightforward how to identify the true hidden factors X from O with the help of Y and E, by leveraging iVAE. We can directly substitute U with (Y , E) in Eqs. (1), and obtain its corresponding generative model: p θ (O, X|Y , E) = p f (O|X)p T ,λ (X|Y , E), (5) p f (O|X) = p (O -f (X)). Algorithm 1: Invariant Causal Representation Learning Phase 1: We first learn the iVAE model, including the generative model and its corresponding inference model, by optimizing the evidence lower bound present in Eq. ( 8) on the data {O, Y , E}. Then, we use the learned iVAE model to infer the corresponding latent variable X from {O, Y , E}, which is guaranteed to be identified up to a permutation and pointwise transformation. Phase 2: Once obtaining X, according to Theorem 1, we can discover from them which are the direct causes Pa(Y ) of Y by only performing Rule 1.4, Rule 1.8, Rule 2.1, and Rule 3.1 described in Appendix E.3. Phase 3: Once obtaining Pa(Y ), we can separately optimize Eq. ( 9) and Eq. ( 10) to learn the invariant data representation Φ and the invariant classifier w. Likewise, we also obtain its corresponding prior distribution and lower bound: p T ,λ (X|Y , E) = i Q i (X i )/Z i (Y , E) • exp k j=1 T i,j (X i )λ i,j (Y , E) , L phase1 (θ, φ) :=E p D E q φ (X|O,Y ,E) [log p θ (O, X|Y , E) -log q φ (X|O, Y , E)] . This bound can be further expanded for computational convenience, which is given in Appendix C. More importantly, under this setting we can directly borrow the identifiability result from Khemakhem et al. ( 2020) and then restate it below by replacing U with (Y , E). Theorem 1. Assume that we observe data sampled from a generative model defined according to Eqs. (5-7), with parameters θ := (f , T , λ) and k ≥ 2. Assume the following holds: (i) The set {O ∈ O|ϕ (O) = 0} has measure zero, where ϕ is the characteristic function of the density p defined in Eq. ( 6). (ii) The mixing function f in Eq. ( 6) is injective, and has all second order cross derivatives. (iii) The sufficient statistics T i,j in Eq. ( 7) are twice differentiable, and (T i,j ) 1≤j≤k are linearly independent on any subset of X of measure greater than zero. (iv) There exist nk + 1 distinct points (Y , E) 0 , . . . , (Y , E) nk such that the matrix L = λ((Y , E) 1 ) -λ((Y , E) 0 ), . . . , λ((Y , E) nk ) -λ((Y , E) 0 ) of size nk × nk is invertible. Then the parameters θ are identifiable up to a permutation and pointwise transformation. Theorem 1 deals with the general case k ≥ 2, whose proof is given in Khemakhem et al. (2020) . 5According to Theorem 1, we can further have the following result of consistency of estimation. Theorem 2. Assume the following holds: (i) The family of distributions q φ (X|O, Y , E) contains p φ (X|O, Y , E). (ii) We maximize L phase1 (θ, φ) with respect to both θ and φ. Then in the limit of infinite data, iVAE learns the true parameters θ * up to a permutation and pointwise transformation. An immediate result of Theorem 1 and Theorem 2 is as follows, Theorem 3. Assume the hypotheses of Theorem 1 and Theorem 2 hold, then in the limit of infinite data, iVAE learns the true latent variables X * up to a permutation and pointwise transformation. The proofs of Theorem 2 and Theorem 3 are given in Appendix E. Theorem 3 says that we can leverage iVAE to learn the true conditionally factorized latent variables up to a permutation and pointwise transformation, which achieves our goal stated in Assumption 1.

4.2. PHASE 2: DISCOVERING DIRECT CAUSES

After identifying all the conditionally factorized latent variables X from O, the question that comes is how to determine which component of X := (X p1 , . . . , X pr , X c1 , . structures by performing conditional independence tests (Spirtes et al., 2000; Zhang et al., 2012) and by leveraging causal discovery algorithms (Janzing et al., 2013; Peters et al., 2017; Zhang et al., 2017; Huang et al., 2020) . This is summarized in Proposition 1 and the proof is given in Appendix E. Proposition 1. All the 12 structures shown in Figs. 2b-2m can be independently and completely distinguished in parallel. Proposition 1 allows us to efficiently discover all direct causes of Y from X by independently performing conditional independence tests and causal discovery algorithms for each X i in parallel. Besides, for each X i we only need to check if they are one of the four structures (i.e., Figs. 2c, 2f, 2i, and 2l) and this check can be also performed in parallel.  Eq. ( 9) and Eq. ( 10) guarantee that ICRL can achieve low error across E tr . Also, in Phase 1&2, we showed that ICRL can enforce invariance across E tr . Now this brings us to the question: how to enable the OOD generalization? In other words, how does ICRL achieve low error across E all ? As Arjovsky et al. ( 2019) pointed out, low error across E tr and invariance across E all leads to low error across E all , because the generalization error of w • Φ respects standard error bounds once the data representation Φ eliciting an invariant predictor w • Φ across E all is estimated. Thus, enabling the OOD generalization finally comes to the question: under which conditions does invariance across E tr imply invariance across E all ? Not surprisingly, E tr must contain sufficient diversity to satisfy an underlying invariance across E all . Fortunately, the hypotheses of Theorem 1 automatically provides such a guarantee, and we therefore have the following result whose proof is in Appendix E. Proposition 2. Assume the hypotheses of Theorem 1 and Theorem 2 hold, then in the limit of infinite data, ICRL learns an invariant predictor w • Φ across E all . Proposition 2 tells us that under the assumptions given in Theorem 1 and Theorem 2, if ICRL can learn an invariant predictor w • Φ across E tr in the limit of infinite data, then such a predictor w • Φ is invariant across E all .

5. EXPERIMENTS

We compare our approach with a variety of methods on both synthetic and real-world datasets. Due to space limit, we put in the supplement a detailed description of the datasets (Appendix F) and model architectures (Appendix H), as well as some in-depth analysis on experimental results (Appendix G). In all the comparisons, unless stated otherwise, we averaged the performance of the different methods over ten runs.

5.1. SYNTHETIC DATA

As a first experiment, in order to interpret how our approach works, we conduct a series of experiments on synthetic data generated according to an extension of the SEM in Model 1. This more practical extension is done by increasing the dimensionality of the two true features X := (X 1 , X 2 ) to 10 dimensions through a linear or nonlinear transformation, as illustrated in Fig. 1b . Technically, the goal is to predict Y from O, where O = g(X) and g(•) is called X Transformer. We consider three types of transformations: (a) Identity: g(•) is the identity matrix I ∈ R 2×2 , i.e.,  O = g(X) = X. (b) Linear: g(•) is a random matrix S ∈ R 2×10 , i.e., O = g(X) = X • S. (c) Nonlinear: g(•) is implemented by a multilayer perceptron with the 2 dimensional input and the 10 dimensional output, whose parameters are randomly initialized in advance. For simplicity, here we consider the regression task, in which the mean squared error (MSE) is used as a metric. We consider an extremely simple scenario in which we fix σ 1 = 1 and σ 2 = 0 for all environments and only allow σ 3 to vary across environments. In this case, σ 3 controls how much the representation depends on the variable X 2 , which is responsible for the spurious correlations. Each experiment draws 1000 samples from each of the five environments σ 3 = {0.2, 2, 5, 20, 100}, where the first two are for training and the rest for testing. We compare with several closely related baselinesfoot_5 : ERM, and two variants of IRMG: F-IRM Game (with Φ fixed to the identity) and V-IRM Game (with a variable Φ). As shown Table 1 , in the cases of Identity and Linear, our approach is better than IRMG but only comparable with ERM and IRM. This might be because the identifiability result up to a pointwise nonlinear transformation renders the problem more difficult than itself, that is, converting the original identity or linear problem to a nonlinear problem. In the Nonlinear case, it is clear that the advantage of our approach becomes more obvious as the spurious correlation becomes stronger. We also perform a series of experiments to further analyze our approach, including the analysis on the importance of Assumption 1 and on the necessity of iVAE in Phase 1, how accurately the direct causes can be recovered in Phase 2, and how well the two optimization problems can be addressed in Phase 3, all of which can be found in Appendix G.

5.2. COLORED MNIST AND COLORED FASHION MNIST

In this section, we conduct experiments on two datasets used in IRM and IRMG: Colored MNIST and Colored Fashion MNIST. We follow the same setting of Ahuja et al. (2020) to create these two datasets. The task is to predict a binary label assigned to each image which is originally grayscale but artificially colored in a way that correlated strongly but spuriously with the class label. We compare with 1) IRM, 2) two variants of IRMG: F-IRM Game (with Φ fixed to the identity) and V-IRM Game (with a variable Φ), 3) three variants of ERM: ERM (on entire training data), ERM e (on each environment e), and ERM GRAYSCALE (on data with no spurious correlations), 4) ROBUST MIN MAX (minimizing the maximum loss across the multiple environments). Table 2 shows that our approach significantly outperforms all the others on Colored Fashion MNIST. It is worth emphasising that both train and test accuracies of our method closely approach the ones of ERM GRAYSCALE and OPTIMAL, implying that it does approximately learn the true invariant causal representation with nearly no correlation with the color. We can achieve a similar conclusion from the results on Colored MNIST as shown in Table 3 . However, this dataset is seemingly more difficult than the fashion version, because ERM GRAYSCALE still underperforms even though the spurious correlation with the color is removed, implying that it might involve some other spurious correlations. In this case, two training environments might be not enough to eliminate all the spurious correlations. ) by exploiting the invariance property in causality which has been widely discussed under the term "autonomy", "modularity", and "stability" (Haavelmo, 1944; Aldrich, 1989; Hoover, 1990; Pearl, 2009; Dawid et al., 2010; Schölkopf et al., 2012) . This invariance property assumed in ICP and its nonlinear extension (Heinze-Deml et al., 2018) is limited, because no intervention is allowed on the target variable Y . Besides, ICP methods implicitly assume that variables of interest X are given. Magliacane et al. (2018) and Subbaswamy et al. (2019) attempt to find invariant predictors that maximally predictive using conditional independence tests and other graph-theoretic tools, both of which also assume that X are given and further assume that additional information about the structure over X is known. Arjovsky et al. (2019) reformulate this invariance as an optimizationbased problem, allowing us to learn the invariant data representation from O that is required to be linear transformations of X. Ahuja et al. (2020) extend IRM to the nonlinear setting from the perspective of game theory, but their nonlinear theory holds only in training environments. 

7. CONCLUSION

We developed a novel framework to learn invariant predictors from a diverse set of training environments. This framework is based on a practical and general assumption: the data representation can be factorized when conditioning on the outcome and the environment. The assumption leads to a guarantee that the components in the representation can be identified up to a permutation and pointwise transformation. This allows us to further discover all the direct causes of the outcome, which enables generalization guarantees in the nonlinear setting. We hope our framework would inspire new ways to address the OOD generalization problem through the causal lens.

A VARIATIONAL AUTOENCODERS

We briefly describe the identifiable variational autoencoders (iVAEs) proposed by Khemakhem et al. (2020) . As we know, the framework of variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) allows us to efficiently learn deep latent-variable models and their corresponding inference models. Consider a simple latent variable model where O ∈ R d stands for an observed variable (random vector) and X ∈ R n for a latent variable. The VAE model actually learns a full generative model p θ (O, X) = p θ (O|X)p θ (X) and an inference model q φ (X|O) that approximates its posterior p θ (X|O), where θ is a vector of parameters of the generative model, φ a vector of parameters of the inference model, and p θ (X) is a prior distribution over the latent variables. Instead of maximizing the data log-likelihood, we maximize its lower bound L VAE (θ, φ): log p θ (O) ≥ L VAE (θ, φ) := E q φ (X|O) [log p θ (O|X)] -KL (q φ (X|O)||p θ (X)) , where we have used Jensen's inequality, and KL(•||•) denotes the Kullback-Leibler divergence between two distributions.

B DEFINITIONS

For convenience, we restates some definitions here and please refer to the original papers (Arjovsky et al., 2019; Peters et al., 2017) for more details. Definition 2. A structural equation model (SEM) C := (S, N ) governing the random vector X = (X 1 , . . . , X d ) is a set of structural equations: S i : X i ← f i (Pa(X i ), N i ), where Pa(X i ) ⊆ {X 1 , . . . , X d } \ {X i } are called the parents of X i , and the N i are independent noise random variables. We say that "X i causes X j " if X i ∈ Pa(X j ). We call causal graph of X to the graph obtained by drawing i) one node for each X i , and ii) one edge from X i to X j if X i ∈ Pa(X j ). We assume acyclic causal graphs. Definition 3. Consider a SEM C := (S, N ). An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SEM C e := (S e , N e ), with structural equations: S e i : X e i ← f e i (Pa e (X e i ), N e i ), The variable X e is intervened if S i = S e i or N i = N e i . Definition 4. Consider a structural equation model (SEM) S governing the random vector (X 1 , . . . , X n , Y ), and the learning goal of predicting Y from X. Then, the set of all environments E all (S) indexes all the interventional distributions P (X e , Y e ) obtainable by valid interventions e. An intervention e ∈ E all (S) is valid as long as (i) the causal graph remains acyclic, (ii) E [Y e |Pa(Y )] = E [Y |Pa(Y )] , and (iii) V [Y e |Pa(Y )] remains within a finite range.

C DERIVATION

In Phase 1, the lower bound is defined by L phase1 (θ, φ) :=E p D E q φ (X|O,Y ,E) [log p θ (O, X|Y , E) -log q φ (X|O, Y , E)] =E p D E q φ (X|O,Y ,E) [log p f (O|X) + log p T ,λ (X|Y , E) -log q φ (X|O, Y , E)] =E p D E q φ (X|O,Y ,E) [log p f (O|X)] + E q φ (X|O,Y ,E) [log p T ,λ (X|Y , E)] -E q φ (X|O,Y ,E) [log q φ (X|O, Y , E)] . The first term is the data log-likelihood and the third term has a closed-form solution, E q φ (X|O,Y ,E) [log q φ (X|O, Y , E)] = - J 2 log(2π) + 1 2 J j=1 1 + log σ 2 j . where σ j is simply denote the j-th element of the variational s.d. (σ) evaluated at datapoint i that is simply a function of (O, Y , E) and the variational parameters φ. Now let us look at the second term, E q φ (X|O,Y ,E) [log p T ,λ (X|Y , E)] =E q φ (X|O,Y ,E)   log   i Q i (X i ) Z i (Y , E) exp   k j=1 T i,j (X i )λ i,j (Y , E)       =E q φ (X|O,Y ,E)   i log   Q i (X i ) Z i (Y , E) exp   k j=1 T i,j (X i )λ i,j (Y , E)       ∝E q φ (X|O,Y ,E)   i k j=1 T i,j (X i )λ i,j (Y , E)   ≈ 1 L l i k j=1 T i,j (X l i )λ i,j (Y , E l ), where we let the base measure Q i (X i ) = 1 and L is the sample size.

D THEOREMS

Theorem 4. Assume that we observe data sampled from a generative model defined according to Eqs. (5-7), with parameters θ = (f , T , λ) and k = 1. Assume the following holds: (i) The set {O ∈ O|ϕ (O) = 0} has measure zero, where ϕ is the characteristic function of the density p defined in Eq. ( 6). (ii) The mixing function f in Eq. ( 6) is injective, and all partial derivatives of f are continuous. (iii) The sufficient statistics T i,j in Eq. ( 7) are differentiable almost everywhere and not monotonic, and (T i,j ) 1≤j≤k are linearly independent on any subset of X of measure greater than zero. (iv) There exist nk + 1 distinct points (Y , E) 0 , . . . , (Y , E) nk such that the matrix L = λ((Y , E) 1 ) -λ((Y , E) 0 ), . . . , λ((Y , E) nk ) -λ((Y , E) 0 ) of size nk × nk is invertible. Then the parameters θ = (f , T , λ) are identifiable up to a permutation and pointwise transformation.

E PROOFS E.1 PROOF OF THEOREM 2

This proof is similar to that of Theorem 4 in Khemakhem et al. (2020) Proof. The loss function in Eq. 3 can be rephrased as follows: L phase1 (θ, φ) = log p θ (O|Y , E) -KL (q φ (X|O, Y , E)||p θ (X|O, Y , E)) . If the family of q φ (X|O, Y , E) is flexible enough to contain p θ (X|O, Y , E), then by optimizing the loss over its parameter φ, we will minimize the KL term which will eventually reach zero, and the loss will be equal to the log-likelihood. Under this circumstance, the iVAE inherits all the properties of maximum likelihood estimation. In this particular case, since our identifiability is guaranteed up to a permutation and pointwise transformation, the consistency of MLE means that we converge to the true parameter θ * up to a permutation and pointwise transformation in the limit of infinite data. Because true identifiability is one of the assumptions for MLE consistency, replacing it by identifiability up to a permutation and pointwise transformation does not change the proof but only the conclusion.

E.2 PROOF OF THEOREM 3

Proof. Theorem 1 and Theorem 2 guarantee that in the limit of infinite data, iVAE can learn the true parameters θ * := (f * , T * , λ * ) up to a permutation and pointwise transformation. Let ( f , T , λ) be the parameters obtained by iVAE, and we therefore have ( f , T , λ) ∼ P (f * , T * , λ * ), where ∼ P denotes the equivalence up to a permutation and pointwise transformation. If there were no noise, this would mean that the learned f transforms O into X = f -1 (O) that are equal to X * = (f * ) -1 (O) up to a permutation and sighed scaling. If with noise, we obtain the posteriors of the latents up to an analogous indeterminacy.

E.3 PROOF OF PROPOSITION 1

Proof. The following rules can be independently performed to distinguish all the 12 possible structures shown in Figs. 2b-2m . For clarity, we divide them into three groups. Group 1 All the eight structures in this group can be discovered only by performing conditional independence tests. • Rule 1.1 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , then Fig. 2b is discovered. • Rule 1.2 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , then Fig. 2h is discovered. • Rule 1.3 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , then Fig. 2e is discovered. • Rule 1.4 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , then Fig. 2i is discovered. • Rule 1.5 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , then Fig. 2g is discovered. • Rule 1.6 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , and X i ⊥ ⊥ Y |E, then Fig. 2k is discovered. • Rule 1.7 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , and X i ⊥ ⊥ E|Y , then Fig. 2j is discovered. • Rule 1.8 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , and Y ⊥ ⊥ E|X i , then Fig. 2f is discovered. Group 2 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E , and E ⊥ ⊥ Y , then we can discover both Fig. 2c and Fig. 2d . These two structures cannot be further distinguished only by conditional independence tests, because they come from the same Markov equivalence class. Fortunately, we can further distinguish them by running binary causal discovery algorithms (Peters et al., 2017) , e.g., ANM (Hoyer et al., 2009) or the bivariate fit model that is based on a best-fit criterion relying on a Gaussian Process regressor. • Rule 2.1 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, and E ⊥ ⊥ Y , and a chosen binary causal discovery algorithm prefers X i → Y to X i ← Y , then Fig. 2c is discovered. • Rule 2.2 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E , and E ⊥ ⊥ Y , and a chosen binary causal discovery algorithm prefers X i ← Y to X i → Y , then Fig. 2d is discovered. Group 3 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , X i ⊥ ⊥ Y |E, X i ⊥ ⊥ E|Y , and Y ⊥ ⊥ E|X i , then we can discover both Fig. 2l and Fig. 2m . These two structures cannot be further distinguished only by conditional independence tests, because they come from the same Markov equivalence class. They also cannot be distinguished by any binary causal discovery algorithm, since both X i and Y are affected by E. Fortunately, Zhang et al. (2017) provided a heuristic solution to this issue based on the invariance of causal mechanisms, i.e., P (cause) and P (effect|cause) change independently. The detailed description of their method is given in Section 4.2 of Zhang et al. (2017) . For convenience, here we directly borrow their final result. Zhang et al. (2017) states that determining the causal direction between X i and Y in Fig. 2l and Fig. 2m is finally reduced to calculating the following term: ∆ Xi→Y = log P (Y |X i ) P (Y |X i ) , where • denotes the sample average, P (Y |X i ) is the empirical estimate of P (Y |X i ) on all data points, and P (Y |X i ) denotes the sample average of P (Y |X i ), which is the estimate of P (Y |X i ) in each environment. We take the direction for which ∆ is smaller to be the causal direction. • Rule 3.1 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , X i ⊥ ⊥ Y |E, X i ⊥ ⊥ E|Y , Y ⊥ ⊥ E|X i ,

and

∆ Xi→Y is smaller than ∆ Y →Xi , then Fig. 2l is discovered. Proof. Firstly, assumption (iii) and assumption (iv) in Theorem 1 are the requirements of the set of training environments containing sufficient diversity and satisfying an underlying invariance which holds across all the environments. Interestingly, assumption (iii) elicits Lemma 4 of Khemakhem et al. (2020) , which is closely similar to the linear general position in Assumption 8 of Arjovsky et al. (2019) . Thus, Lemma 4 can be similarly called the nonlinear general position in our generalization theory, whose proof can be found in Arjovsky et al. (2019) . Secondly, when the set of training environments lie in this nonlinear general position and the other hypotheses of Theorem 1&2 hold, it is guaranteed in Theorem 3 that all the latent factors X can be identified up to a permutation and pointwise transformation. Since this identifiability result holds under the assumptions guaranteeing that training environments contain sufficient diversity and satisfy an underlying invariance which holds across all the environments, it also holds across all the environments. Thirdly, Proposition 1 suggests that all the direct causes Pa(Y ) of Y can be fully discovered, which also holds across all the environments due to the same reason above. Finally, the challenging bi-leveled optimization problem in both IRM and IRMG now can be reduced to two simpler independent optimization problems: (i) learning the invariant data representation Φ from O to Pa(Y ), and (ii) learning the invariant classifier w from Pa(Y ) to Y , as described in Eq. ( 9) and Eq. (10). For both (i) and (ii), since there exist no spurious correlations between O and Pa(Y ) and between Pa(Y ) and Y , learning theory guarantees that in the limit of infinite data, we will converge to the true invariant data representation Φ and the true invariant classifier w. • Rule 3.2 If X i ⊥ ⊥ Y , X i ⊥ ⊥ E, E ⊥ ⊥ Y , X i ⊥ ⊥ Y |E, X i ⊥ ⊥ E|Y , Y ⊥ ⊥ E|X i , It is worth noting that although assumption (iii) and assumption (iv) in Theorem 1 require complicated conditions to satisfy the diversity across training environments for generalization guarantees, it is not the case in practice. As we will observe in our experiments, it is often the case that two environments are sufficient to recover invariances.

F DATASETS

For convenience and completeness, we provide the descriptions of Colored MNIST Digits, Colored Fashion MNIST, and Office-Home here. Please refer to the original papers (Arjovsky et al., 2019; Ahuja et al., 2020; Gulrajani & Lopez-Paz, 2020; Venkateswara et al., 2017) for more details.

F.1 SYNTHETIC DATA

For the nonlinear transformation, we use the MLP: • Input layer: Input batch (batch size, input dimension) • Layer 1: Fully connected layer, output size = 6, activation = ReLU • Output layer: Fully connected layer, output size = 10

F.2 COLORED MNIST DIGITS

We use the exact same environment as in Arjovsky et al. (2019) . Arjovsky et al. (2019) propose to create an environment for training to classify digits in MNIST digits data 7 , where the images in MNIST are now colored in such a way that the colors spuriously correlate with the labels. The task is to classify whether the digit is less than 5 (not including 5) or more than 5. There are three environments (two training containing 30,000 points each, one test containing 10,000 points) We add noise to the preliminary label (ỹ = 0 if digit is between 0-4 and ỹ = 0 if the digit is between 5-9) by flipping it with 25 percent probability to construct the final labels. We sample the color id z by flipping the final labels with probability p e , where p e is 0.2 in the first environment, 0.1 in the second environment, and 0.9 in the third environment. The third environment is the testing environment. We color the digit red if z = 1 or green if z = 0.

F.3 COLORED FASHION MNIST

We modify the fashion MNIST datasetfoot_7 in a manner similar to the MNIST digits dataset. Fashion MNIST data has images from different categories: "t-shirt", "trouser", "pullover", "dress", "coat", "sandal", "shirt", "sneaker", "bag", "ankle boots". We add colors to the images in such a way that the colors correlate with the labels. The task is to classify whether the image is that of foot wear or a clothing item. There are three environments (two training, one test) We add noise to the preliminary label (ỹ = 0: "t-shirt", "trouser", "pullover", "dress", "coat", "shirt" and ỹ = 1: "sandal", "sneaker", "ankle boots") by flipping it with 25 percent probability to construct the final label. We sample the color id z by flipping the noisy label with probability p e , where p e is 0.2 in the first environment, 0.1 in the second environment, and 0.9 in the third environment, which is the test environment. We color the object red if z = 1 or green if z = 0. Theorem 3 tells us that we can leverage iVAE to learn the true conditionally factorized latent variables up to a permutation and pointwise transformation. We empirically verify this point by comparing the raw data X with the corresponding X inferred through the learned inference model in iVAE. Fig. 3 clearly shows that the inferred X is equal to X up to a permutation and pointwise transformation. To show the importance of iVAE, we conduct an experiment in which we replace iVAE with the original VAE in Phase 1. As shown in Table 4 , the performance of ICRL based on iVAE significantly outperforms the one based on VAE. It is worth noting that when VAE is instead used in Phase 1, it usually occurs in Phase 2 that either all the dimensions or no dimension of X are identified as the parents of Y . This is because all components of X are mixed together and will influence one another even when conditioning on Y and E.

G.2 VERIFYING PHASE 2

To show how well our method can identify the direct causes of Y in Phase 2, we compare the final performance when the identified direct cause (i.e., X 1 ) and the identified non-cause (i.e., X 2 ) are respectively used in Phase 3 to learn the predictor. Note that, there might exist a permutation between the inferred { X1 , X2 } and the true {X 1 , X 2 }. Fig. 4a and Fig. 4b show the results on the regression task, from which we can obviously see that the predictor elicited by the identified cause has a much better generalization performance. The classification result (i.e., Y is binarized) in Fig. 4c further demonstrates this point.

G.3 VERIFYING PHASE 3

In this experiment, we want to verify how well the data representation Φ can be learned by optimizing the loss in Eq. ( 9). The main idea is to check how well the learned Φ can purely extract the ), where X = (X 1 , X 2 ). Then, we observe how Xi will change while tuning X 1 and X 2 respectively. Fig. 5 shows the energy plots when X1 is the identified cause of Y and X2 the child of Y . Note that, in the plots, ΦXi is denoted by f Xi . In theory, we are able to learn an invariant data representation Φ X1 for X 1 from O, because there is no spurious correlation between X 1 and O. By contrast, we cannot learn an invariant data representation Φ X2 for X 2 from O, because there exist spurious correlations between X 2 and O. The results shown in Fig. 5 clearly verify our theory. Specifically, in the left plot, X1 remains approximately unchanged when changing X 2 but it changes when changing X 1 . However, in the right plot, X2 changes whether we change X 1 or X 2 .

H MODEL ARCHITECTURES

In this section, we describe the architectures of different models used in different experiments. 



A brief description of variational autoencoders is given in Appendix A. This setup for labels applies to both continuous and categorical data, where the categorical data can be encoded in the one-hot form. The relation between SEM and its graphical representation is formally defined in Appendix B. As an initial work, we do not explicitly consider unobserved confounders in this paper. Precisely, for simplicity, we assume that there are no unobserved confounders between Xi, Y , O, and E. In fact, in some cases, our approach will not be affected even if there exist unobserved confounders. For example, if there were an unobserved confounder between Xi and O, it would be absorbed into Xi and would not affect our approach. We also provide a theorem dealing with the special case k = 1 in Appendix D. We also tried ICP, but surprisingly ICP cannot find any parent of Y even in the identity case. https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/ load_data https://www.tensorflow.org/api_docs/python/tf/keras/datasets/fashion_ mnist/load_data



Figure 1: (a) Causal structure of Model 1. (b) A more practical extension of Model 1, where X1 and X2 are not directly observed and O is their observation. (c) A general version of (b), where we assume there exist multiple unobserved variables. Each of them could be either a parent, a child of Y , or has no connection with Y . Grey nodes denote observed variables and white nodes represent unobserved variables.

Figure 2: (a) General causal structure over {Xi, Y , O, E}, where the arrow from Xi to O is a must-have connection and the other four might not be necessarily. (b) 12 possible causal structures from (a).

PHASE 3: LEARNING AN INVARIANT PREDICTOR Once obtaining the invariant causal representation Pa(Y ) for Y across the training environments, learning an invariant predictor w • Φ in Eq. (4) is then reduced to two simpler independent optimization problems: (i) learning the invariant data representation Φ from O to Pa(Y ), and (ii) learning the invariant classifier w from Pa(Y ) to Y . Mathematically, these two optimization problems can be respectively phrased as min Φ∈HΦ e∈Etr R e (Φ) = min Φ∈HΦ e∈Etr E O e ,Pa(Y e ) [ (Φ(O e ), Pa(Y e ))] , Pa(Y e ),Y e [ (w(Pa(Y e )), Y e )] .

Doman generalization emphasizes the ability to transfer acquired knowledge to domains unseen during training. A wide range of methods has been proposed for learning domain-invariant representations.Khosla et al. (2012) develop a max-margin classifier that explicitly exploits the effect of dataset bias and improve generalization ability to unseen domains.Fang et al. (2013) propose a metric learning approach based on structural SVM such that the neighbors of each training sample consist of examples from both the same and different domains.Muandet et al. (2013) propose a kernel-based optimization algorithm called Domain-Invariant Component Analysis (DICA), which aims to both minimize the discrepancy among domains and prevent the loss of relationship between input and output features.Ghifary et al. (2015) train a multi-task autoencoder that recognizes invariances among domains by learning to reconstruct analogs of original inputs from different domains.Motiian et al. (2017) learn an embedding subspace where samples from different domains are close if they have the same class labels, and far away if they bear different class labels.Li et al. (2018b)  minimizes the differences in joint distributions to achieve target domain generalization through the application of a conditional invariant adversarial network.Li  et al. (2018a)  build on the adversarial autoencoders by considering maximum mean discrepancy regularization and aligning the domains' distributions.

and ∆ Y →Xi is smaller than ∆ Xi→Y , then Fig. 2m is discovered. E.4 PROOF OF PROPOSITION 2

Figure 3: Left: Comparison of the raw data X and the mean of X inferred through the learned inference model in the Nonlinear case. Right: Comparison of the raw data X and the sampled points X using the reparameterization trick in the Nonlinear case. The comparisons clearly show that the inferred X is equal to X up to a permutation and pointwise transformation.

Regression results in the Linear case in terms of MSE, where the inferred X2 is the identified cause. Regression results in the Nonlinear case in terms of MSE, where the inferred X2 is the identified cause. Classification results on Synthetic Data in terms of accuracy, where the inferred X1 is the identified cause.

Figure 4: Comparison of the inferred X1 and X2 in terms of their final performance.

Figure 5: Left: Energy plot of X1 = f X1 (g(X)). Right: Energy plot of X2 = f X2 (g(X)). Note that, here ΦXi is denoted by f Xi . Table 4: Results on synthetic Data: Comparison of iVAE and VAE used in Phase 1 in terms of MSE (mean ± std deviation).X TRANSFORMER ALGORITHM TRAIN MSE (σ3 = {0.2, 2}) TEST MSE (σ3 = 5)

Input layer: Input batch (batch size, input dimension) • Output layer: Fully connected layer, output size = 1 Layer 5: Deconvolutional layer, output channels = 2, kernel size = 3, stride = 2, padding = 1, outpadding = 1 • Mean Output layer: activation = Sigmoid • Variance Output layer: 0.01 × 1, where 1 is a matrix full of 1 with the size of 2 × 28 × 28. Data Representation Φ • Input layer: Input batch (batch size, 2, 28, 28) • Layer 1: Convolutional layer, output channels = 32, kernel size = 3, stride = 2, padding = 1, activation = ReLU • Layer 2: Convolutional layer, output channels = 32, kernel size = 3, stride = 2, padding = 1, activation = ReLU • Layer 3: Convolutional layer, output channels = 32, kernel size = 3, stride = 2, padding = 1, activation = ReLU • Layer 4: Flatten • Mean Output layer: Fully connected layer, output size = 100 • Log Variance Output layer: Fully connected layer, output size = 100 Classifier w Input layer: Input batch (batch size, 100) • Layer 1: Fully connected layer, output size = 100, activation = ReLU • Output layer: Fully connected layer, output size = 1, activation = Sigmoid

Φ across E tr obtained by solving Eq. (4) remains invariant in E all . It is worth noting thatAhuja et al. (2020) reconsidered this IRM problem from the perspective of game theory, called IRMG for short. Although in the new formulation they proved that there exist such invariant predictors in E tr when both Φ and w are relaxed to the nonlinear models, their main generalization result in E all holds only when both Φ and w are linear models (Theorem 2 inAhuja et al. (2020)).

. . , X c k ) is the direct cause of Y , denoted by Pa(Y ). As discussed in Section 3.2, there are totally 12 possible types of structures over {X

Regression on synthetic data: Comparison of methods in terms of MSE (mean ± std deviation).

Colored Fashion MNIST: Comparison of methods in terms of accuracy (mean ± std deviation).

Colored MNIST: Comparison of methods in terms of accuracy (mean ± std deviation).Invariant PredictionPeters et al. (2015) originally introduced the theory of Invariant Causal Prediction (ICP), aiming to find the causal feature set (i.e., all direct causes of a target variable of interest.

