LEARNING CAUSAL SEMANTIC REPRESENTATION FOR OUT-OF-DISTRIBUTION PREDICTION

Abstract

Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domainspecific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on causality to model the two factors separately, and learn it on a single training domain for prediction without (OOD generalization) or with unsupervised data (domain adaptation) in a test domain. We prove that CSG identifies the semantic factor on the training domain, and the invariance principle of causality subsequently guarantees the boundedness of OOD generalization error and the success of adaptation. We also design novel and delicate learning methods for both effective learning and easy prediction, following the first principle of variational Bayes and the graphical structure of CSG. Empirical study demonstrates the effect of our methods to improve test accuracy for OOD generalization and domain adaptation.

1. INTRODUCTION

Deep learning has initiated a new era of artificial intelligence where the potential of machine learning models is greatly unleashed. Despite the great success, these methods heavily rely on the independently-and-identically-distributed (IID) assumption. This does not always perfectly hold in practice, and the prediction of output (label, response, outcome) y may be saliently affected in out-of-distribution (OOD) cases, even from an essentially irrelevant change to the input (covariate) x, like a position shift or rotation of the object in an image, or a change of background, illumination or style (Shen et al., 2018; He et al., 2019; Arjovsky et al., 2019) . These phenomena pose serious concerns on the robustness and trustworthiness of machine learning methods and severely impede them from risk-sensitive scenarios. Looking into the problem, although deep learning models allow extracting abstract representation for prediction with their powerful approximation capacity, the representation may be overconfident in the correlation between semantic factors s (e.g., shape of an object) and variation factors v (e.g., background, illumination, object position). The correlation may be domain-specific and spurious, and may change drastically in a new environment. So it has become a desire to learn representation that separates semantics s from variations v (Cai et al., 2019; Ilse et al., 2019) . Formally, the importance of this goal is that s represents the cause of y. Causal relations better reflect the fundamental mechanisms of nature, bringing the merit to machine learning that they tend to be universal and invariant across domains (Schölkopf et al., 2012; Peters et al., 2017; Schölkopf, 2019) , thus providing the most transferable and confident information to unseen domains. Causality has also been shown to lead to proper domain adaptation (Schölkopf et al., 2012; Zhang et al., 2013) , lower adaptation cost and lighter catastrophic forgetting (Peters et al., 2016; Bengio et al., 2019; Ke et al., 2019) . In this work, we propose a Causal Semantic Generative model (CSG) for proper and robust OOD prediction, including OOD generalization and domain adaptation. Both tasks have supervised data from a single training domain, but domain adaptation has unsupervised test-domain data during learning, while OOD generalization has no test-domain data, including cases where queries come sequentially or adaptation is unaffordable. (1) We build the model by cautiously following the principle of causality, where we explicitly separate the latent variables into a (group of) semantic factor s and a (group of) variation factor v. We prove that under appropriate conditions CSG identifies the semantic factor by fitting training data, even in presence of an s-v correlation. (2) By leveraging the causal invariance, we prove that a well-learned CSG is guaranteed to have a bounded OOD generalization error. The bound shows how causal mechanisms affect the error. (3) We develop a domain adaptation method using CSG and causal invariance, which suggests to fix the causal generative mechanisms and adapt the prior to the new domain. We prove the identification of the new prior and the benefit of adaptation. (4) To learn and adapt the model from data, we design novel and delicate reformulations of the Evidence Lower BOund (ELBO) objective following the graphical structure of CSG, so that the inference models required therein can also serve for prediction, and modeling and optimizing inference models in both domains can be avoided. To our best knowledge, our work is the first to identify semantic factor and leverage latent causal invariance for OOD prediction with guarantees. Empirical improvement in OOD performance and adaptation is demonstrated by experiments on multiple tasks including shifted MNIST and ImageCLEF-DA task.

2. RELATED WORK

There have been works that aim to leverage the merit of causality for OOD prediction. For OOD generalization, some works ameliorate discriminative models towards a causal behavior. Bahadori et al. (2017) introduce a regularizer that reweights input dimensions based on their approximated causal effects to the output, and Shen et al. (2018) reweight training samples by amortizing causal effects among input dimensions. They are extended to nonlinear cases (Bahadori et al., 2017; He et al., 2019) via linear-separable representations. Heinze-Deml & Meinshausen (2019) enforce inference invariance by minimizing prediction variance within each label-identity group. These methods introduce no additional modeling effort, but may also be limited to capture invariant causal mechanisms (they are non-generative) and may only behave quantitatively causal in the training domain. For domain adaptation/generalization, methods are developed under various causal assumptions (Schölkopf et al., 2012; Zhang et al., 2013) or using learned causal relations (Rojas-Carulla et al., 2018; Magliacane et al., 2018) . Zhang et al. (2013) ; Gong et al. (2016; 2018) also consider certain ways of mechanism shift. The considered causality is among directly observed variables, which may not be suitable for general data like image pixels where causality rather lies between data and conceptual latent factors (Lopez-Paz et al., 2017; Besserve et al., 2018; Kilbertus et al., 2018) . To consider latent factors, there are domain adaptation (Pan et al., 2010; Baktashmotlagh et al., 2013; Ganin et al., 2016; Long et al., 2015; 2018) and generalization methods (Muandet et al., 2013; Shankar et al., 2018) that learn a representation with domain-invariant marginal distribution, and have achieved remarkable results. Nevertheless, Johansson et al. (2019) ; Zhao et al. (2019) point out that this invariance is neither sufficient nor necessary to identify the true semantics and lower the adaptation error (Supplement D). Moreover, these methods and invariance risk minimization (Arjovsky et al., 2019 ) also assume the invariance in the inference direction (i.e., data → representation), which may not be as general as causal invariance in the generative direction (Section 3.2). There are also generative methods for domain adaptation/generalization that model latent factors. Cai et al. (2019) ; Ilse et al. (2019) introduce a semantic factor and a domain-feature factor. They assume the two factors are independent in both the generative and inference models, which may not meet reality closely. They also do not adapt the prior for domain shift thus resort to inference invariance. Zhang et al. (2020) consider a partially observed manipulation variable, while assume its independence from the output in both the joint and posterior, and the adaptation is inconsistent with causal invariance. Atzmon et al. (2020) consider similar latent factors, but use the same (uniform) prior in all domains. These methods also do not show guarantees to identify their latent factors. Teshima et al. (2020) leverage causal invariance and adapt the prior, while also assume latent independence and do not separate the semantic factor. They require some supervised test-domain data, and their deterministic and invertible mechanism also indicates inference invariance. In addition, most domain generalization methods require multiple training domains, with exceptions (e.g., Qiao et al., 2020) that still seek to augment domains. In contrast, CSG leverages causal invariance, and has guarantee to identify the semantic factor from a single training domain, even with a correlation to the variation factor. Generative supervised learning is not new (Mcauliffe & Blei, 2008; Kingma et al., 2014) , but most works do not consider the encoded causality. Other works consider solving causality tasks, notably causal/treatment effect estimation (Louizos et al., 2017; Yao et al., 2018; Wang & Blei, 2019) . The task does not focus on OOD prediction, and requires labels for both treated and controlled groups. Disentangling latent representations is also of interest in unsupervised learning. Despite some empirical success (Chen et al., 2016; Higgins et al., 2017; Chen et al., 2018) , Locatello et al. (2019) conclude that it is impossible to guarantee the disentanglement in unsupervised settings. Khemakhem et al. (2019; 2020) show an encouraging result that disentangled representation can be identified up to a permutation with a cause of the latent variable observed. But the methods cannot separate the semantic factor from variation for supervised learning, and require observing sufficiently many different values of the cause variable, making it hard to leverage labels. Causality with latent variable has been considered in a rich literature (Verma & Pearl, 1991; Spirtes et al., 2000; Richardson et al., 2002; Hoyer et al., 2008; Shpitser et al., 2014) , while most works focus on the consequence on observation-level causality. Others consider identifying the latent variable. Janzing et al. (2009) ; Lee et al. (2019) show the identifiability under additive noise or similar assumptions. For discrete data, a "simple" latent variable can be identified under various specifications (Janzing et al., 2011; Sgouritsa et al., 2013; Kocaoglu et al., 2018) . Romeijn & Williamson (2018) leverage interventional datasets. Over these works, we step further to separate and identify the latent variable as semantic and variation factors, and show the benefit for OOD prediction.

3. THE CAUSAL SEMANTIC GENERATIVE MODEL

To develop the model seriously and soberly based on causality, we require the formal definition of causality: two variables have a causal relation, denoted as "cause→effect", if externally intervening the cause (by changing variables out of the considered system) may change the effect, but not vice versa (Pearl, 2009; Peters et al., 2017) . We then follow the logic below to build our model.foot_0 (1) It may be a general case that neither y → x (e.g., adding noise to the labels in a dataset does not change the images) nor x → y holds (e.g., intervening an image by e.g. breaking a camera sensor unit when taking the image, does not change how the photographer labels it), as also argued by Peters et al. (2017, Section 1.4 ); Kilbertus et al. (2018) . So we employ a generative model (i.e., not only modeling p(y|x)), and introduce a latent variable z to capture factors with causal relations. (2) The latent variable z as underlying generating factors (e.g., object features like shape and texture, background and illumination in imaging) is plausible to cause both x (e.g., the change of object shape or background makes a different image, but breaking a camera sensor unit does not change the object shape or background) and y (e.g., the photographer would give a different label if the object shape, texture, etc. had been replaced by those of a different object, but noise-corrupting the label does not change the object features). So we orient the edges in the generative direction z → (x, y), as also adopted by Mcauliffe & Blei (2008) ; Peters et al. (2017) ; Teshima et al. (2020) . This is in contrast to Cai et al. (2019) ; Ilse et al. (2019; 2020); Castro et al. (2020) who treat y as the cause of a semantic factor, which, when y is also a noisy observation, makes unreasonable implications (e.g., adding noise to the labels in a dataset automatically changes object features and consequently the images, and changing the object features does not change the label). This difference is also discussed by Peters et al. (2017, Section 1.4 ); Kilbertus et al. (2018) . (3) We attribute all x-y relation to the existence of some latent factors ("purely common cause", Lee et al., 2019; Janzing et al., 2009) , and exclude x-y edges. This can be achieved as long as z holds sufficient information of data (e.g., with shape, background etc. fixed, breaking a sensor unit does not change the label, and noise-corrupting the label does not change the image). Promoting this restriction reduces arbitrariness in explaining x-y relation and benefits the identification of z. This is in contrast to Kingma et al. (2014) ; Zhang et al. (2020); Castro et al. (2020) who treat y as a cause of x since no latent variable is introduced between. (4) Not all latent factors are the causes of y (e.g., changing the shape may alter the label, while changing the background does not). We thus split the latent variable as z = (s, v) and remove the edge v → y, where s represents the semantic factor of x that causes y, and v describes the variation or diversity in generating x. This formalizes the intuition on the concepts in Introduction. (5) The variation v often has a relation to the semantics s, which is often a spurious correlation (e.g., desks prefer a workspace background, but they can also appear in bedrooms and beds can also appear in workspace). So we keep the undirected s-v edge. Although v is not a cause of y, modeling it explicitly is worth the effort since otherwise it would still be implicitly incorporated in s anyway through the s-v correlation. We summarize these conclusions in the following definition. Definition 3.1 (CSG). A Causal Semantic Generative Model (CSG) p = (p(s, v), p(x|s, v), p(y|s)) is a generative model on data variables x ∈ X ⊂ R d X and y ∈ Y with semantic s ∈ S ⊂ R d S and variation v ∈ V ⊂ R d V latent variables, following the graphical structure shown in Fig. 1 .

3.1. THE CAUSAL INVARIANCE PRINCIPLE

The domain-invariance of causal relations translates to the following principle for CSG: Principle 3.2 (causal invariance). The causal generative mechanisms p(x|s, v) and p(y|s) in CSG are invariant across domains, and the change of prior p(s, v) is the only source of domain shift. It is supported by the invariance of basic laws of nature (Schölkopf et al., 2012; Peters et al., 2017; Besserve et al., 2018; Bühlmann, 2018; Schölkopf, 2019) . Other works instead introduce domain index (Cai et al., 2019; Ilse et al., 2019; 2020; Castro et al., 2020) or manipulation variables (Zhang et al., 2020; Khemakhem et al., 2019; 2020) to model distribution change explicitly. They require multiple training domains or additional observations, and such changes can also be explained under causal invariance as long as the latent variable includes all shifted factors (e.g., domain change of images can be attributed to a different preference of shape, style, texture, background, etc. and their correlations, while the processes generating image and label from them remain the same).

3.2. COMPARISON WITH INFERENCE INVARIANCE

Figure 2 : Examples of noisy (left) or degenerate (right) generating mechanisms that lead to ambiguity in inference. Left: handwritten digit that may be generated as either "3" or "5". Right: Schröder's stairs that may be generated with either A or B being the nearer surface. Inference results notably rely on the prior on the digits/surfaces, which is domain-specific. Domain-invariant-representation-based adaptation and generalization methods, and invariant risk minimization (Arjovsky et al., 2019) for domain generalization, use a shared feature extractor across domains. This effectively assumes the invariance of the process in the other direction, i.e., inferring the latent representation from data. We note that in its supportive examples (e.g., inferring the object position from an image, or extracting the fundamental frequency from a vocal audio), generating mechanisms are nearly deterministic and invertible, so that the posterior is almost determined by the inverse function, and causal invariance implies inference invariance. For noisy or degenerate mechanisms (Fig. 2 ), ambiguity occurs during inference since there may be multiple values of a latent feature that generate the same observation. The inferred feature would notably rely on the prior through the Bayes rule. Since the prior changes across domains, the inference rule then changes by nature, which challenges the existence of a domain-shared feature extractor. In this case, causal invariance is more reliable than inference invariance. To leverage causal invariance, we adjust the prior conservatively for OOD generalization (CSG-ind) and data-driven for domain adaptation (CSG-DA), so together with the invariant generative mechanisms, it gives a different and more reliable inference rule than that following inference invariance.

4. METHOD

We develop learning, adaptation and prediction methods for OOD generalization and domain adaptation using CSG following the causal invariance Principle 3.2, and devise practical objectives using variational Bayes. Supplement E.1 details all the derivations.

4.1. METHOD FOR OOD GENERALIZATION

For OOD generalization, a CSG p = (p(s, v), p(x|s, v), p(y|s)) needs to first learn from the supervised data from an underlying data distribution p * (x, y) on the training domain. Maximizing likelihood E p * (x,y) [log p(x, y)] is intractable since p(x, y) given by the CSG p is hard to estimate effectively. We thus adopt the Evidence Lower BOund (ELBO) L q,p (x, y) := E q(s,v|x,y) [log p(s,v,x,y) q(s,v|x,y) ] ( Jordan et al., 1999; Wainwright et al., 2008) as a tractable surrogate, which requires an auxiliary inference model q(s, v|x, y) to estimate the expectation effectively. Maximizing L q,p w.r.t q drives q towards the posterior p(s, v|x, y) and meanwhile makes L q,p a tighter lower bound of log p(x, y). The expected ELBO E p * (x,y) [L q,p (x, y)] then drives p(x, y) towards p * (x, y). However, the subtlety with supervised learning is that after fitting data, evaluating p(y|x) for prediction is still hard. We thus propose to employ a model for q(s, v, y|x) instead. The required inference model can be then expressed as q(s, v|x, y) = q(s, v, y|x)/q(y|x) where q(y|x) = q(s, v, y|x) dsdv. It reformulates the expected ELBO as: E p * (x,y) [L q,p (x, y)] = E p * (x) E p * (y|x) [log q(y|x)] + E p * (x) E q(s,v,y|x) p * (y|x) q(y|x) log p(s, v, x, y) q(s, v, y|x) . (1) The first term is the common cross entropy loss (negative) driving q(y|x) towards p * (y|x). Once this is achieved, the second term becomes the expected ELBO E p * (x) [L q(s,v,y|x),p (x)] that drives q(s, v, y|x) towards p(s, v, y|x) (and p(x) towards p * (x)). Since the target p(s, v, y|x) admits the factorization p(s, v|x)p(y|s) (since (v, x) ⊥ ⊥ y|s ) where p(y|s) is already given by the CSG, we can further ease the modeling of q(s, v, y|x) as q(s, v|x)p(y|s). The ELBO is then reformulated as: L q,p (x, y) = log q(y|x) + 1 q(y|x) E q(s,v|x) p(y|s) log p(s, v)p(x|s, v) q(s, v|x) , where q(y|x) = E q(s,v|x) [p(y|s)]. The CSG p and q(s, v|x) are to be optimized. The expectations can be estimated by Monte Carlo, and their gradients can be estimated using the reparameterization trick (Kingma & Welling, 2014) . When well optimized, q(s, v|x) well approximates p(s, v|x), so q(y|x) then well approximates p(y|x) = E p(s,v|x) [p(y|s)] for prediction.

CSG-ind

To actively mitigate the spurious s-v correlation from the training domain, we also consider a CSG with an independent prior p ⊥ ⊥ (s, v) := p(s)p(v) for prediction in the unknown test domain, where p(s) and p(v) are the marginals of p(s, v). The independent prior p ⊥ ⊥ (s, v) encourages the model to stay neutral on the s-v correlation. It has a larger entropy than p(s, v) (Cover & Thomas, 2006, Theorem 2.6.6) , so it reduces the information of the training-domain-specific prior. The model then relies more on the invariant generative mechanisms, thus better leverages causal invariance and gives more reliable prediction than that following inference invariance. For the method, note that the prediction is given by p ⊥ ⊥ (y|x) = E p ⊥ ⊥ (s,v|x) [p(y|s)], so we use an inference model for q ⊥ ⊥ (s, v|x) that approximates p ⊥ ⊥ (s, v|x). However, learning on the training domain still requires the original inference model q(s, v|x). To save the cost of building and learning two inference models, we propose to use q ⊥ ⊥ (s, v|x) to represent q(s, v|x). Noting that their targets are related by p(s, v|x) = p(s,v) p ⊥ ⊥ (s,v) p ⊥ ⊥ (x) p(x) p ⊥ ⊥ (s, v|x), we formulate q(s, v|x) = p(s,v) p ⊥ ⊥ (s,v) p ⊥ ⊥ (x) p(x) q ⊥ ⊥ (s, v|x) accordingly, so that this q(s, v|x) achieves its target once q ⊥ ⊥ (s, v|x) does. The ELBO then becomes: L q,p (x, y) = log π(y|x) + 1 π(y|x) E q ⊥ ⊥ (s,v|x) p(s, v) p ⊥ ⊥ (s, v) p(y|s) log p ⊥ ⊥ (s, v)p(x|s, v) q ⊥ ⊥ (s, v|x) , where π(y|x) := E q ⊥ ⊥ (s,v|x) p(s,v) p ⊥ ⊥ (s,v) p(y|s) . The CSG p and q(s, v|x) are to be optimized (note that p ⊥ ⊥ (s, v) is determined by p(s, v) in the CSG p). Prediction is given by p ⊥ ⊥ (y|x) ≈ E q ⊥ ⊥ (s,v|x) [p(y|s)].

4.2. METHOD FOR DOMAIN ADAPTATION

When unsupervised data is available from an underlying data distribution p * (x) on the test domain, we can leverage it for adaptation. According to the causal invariance Principle 3.2, we only need to adapt for the test-domain prior p(s, v) and the corresponding inference model q(s, v|x), while the causal mechanisms p(x|s, v) and p(y|s) are not optimized. Adaptation is done by fitting the test data via maximizing E p * (x) [L q, p(x)], where the ELBO is in the standard form: L q, p(x) = E q(s,v|x) log p(s, v)p(x|s, v)/q(s, v|x) . (4) Prediction is given by p(y|x) ≈ E q(s,v|x) [p(y|s)]. Similar to the case of CSG-ind, we need q(s, v|x) for prediction, but q(s, v|x) is still required for learning on the training domain. When data from both domains are available during learning, we can save the effort of modeling and learning q(s, v|x) using a similar technique. We formulate it using q(s, v|x) as q(s, v|x) = p(x) p(x) p(s,v) p(s,v) q(s, v|x) following the same relation between their targets, and the ELBO on the training domain becomes: L q,p (x, y) = log π(y|x) + 1 π(y|x) E q(s,v|x) p(s, v) p(s, v) p(y|s) log p(s, v)p(x|s, v) q(s, v|x) , where π(y|x) := E q(s,v|x) p(s,v) p(s,v) p(y|s) . The CSG p and q(s, v|x) are to be optimized (not for p(s, v)). The resulting method, termed CSG-DA, solves both optimizations (4, 5) simultaneously. For implementing the three methods, note that only one inference model is required in each case. Supplement E.2 shows its implementation from a general discriminative model (e.g., how to select its hidden nodes as s and v). In practice x often has a much larger dimension than y, making the supervised part of the training-domain ELBO (i.e., the first term in its formulation Eq. ( 1)) scales smaller than the unsupervised part. So we include an additional cross entropy loss in the objectives.

5. THEORY

We now establish guarantee for the methods on identifying the semantic factor and the subsequent merits for OOD generalization and domain adaptation. We only consider the infinite-data regime to isolate another source of error from finite data. Supplement A shows all the proofs. Identifiability is hard to achieve for latent variable models (Koopmans & Reiersol, 1950; Murphy, 2012; Yacoby et al., 2019; Locatello et al., 2019) , since it is a task beyond modeling observational relations (Janzing et al., 2009; Peters et al., 2017) . Assumptions are required to draw definite conclusions. Assumption 5.1 (additive noise). There exist nonlinear functions f and g with bounded derivatives up to third-order, and independent random variables µ and ν, such that p(x|s, v) = p µ (x -f (s, v)), and p(y|s) = p ν (y -g(s)) for continuous y or p(y|s) = Cat(y|g(s)) for categorical y. This structure disables describing a bivariate joint distribution in both generating directions (Zhang & Hyvärinen (2009, Theorem 8) , Peters et al. (2014, Proposition 23 )), and is widely adopted in directed causal discovery (Janzing et al., 2009; Bühlmann et al., 2014) . CSG needs this since it should make the causal direction exclusive. It is also easy to implement with deep models (Kingma & Welling, 2014), so does not essentially restrict model capacity. Assumption 5.2 (bijectivity). Function f is bijective and g is injective. It is a common assumption for identifiability (Janzing et al., 2009; Shalit et al., 2017; Khemakhem et al., 2019; Lee et al., 2019) . Under Assumption 5.1, it is a sufficient condition (Peters et al., 2014, Proposition 17; Peters et al., 2017, Proposition 7.4 ) of causal minimality (Peters et al., 2014 (Peters et al., , p.2012;; Peters et al., 2017, Definition 6.33 ), a fundamental requirement for identifiability (Peters et al., 2014, Proposition 7; Peters et al., 2017, p.109 ). Particularly, s and v are otherwise allowed to have dummy dimensions that f and g simply ignore, raising another ambiguity against identifiability. On the other hand, according to the commonly acknowledged manifold hypothesis (Weinberger & Saul, 2006; Fefferman et al., 2016) that data tends to lie on a lower-dimensional manifold embedded in the data space, we can take X as the manifold and such a bijection exists as a coordinate map, which is an injection to the original data space (thus allowing d S + d V < d X ).

5.1. IDENTIFIABILITY THEORY

We first formalize the goal of identifying the semantic factor. Definition 5.3 (semantic-equivalence). We say two CSGs p and p are semantic-equivalent, if there exists a homeomorphism 2 Φ on S × V, such that (i) its output dimensions in S is constant of v: Φ S (s, v) = Φ S (s) for any v ∈ V, and (ii) it acts as a reparameterization from p to p : Φ # [p s,v ] = p s,v , p(x|s, v) = p (x|Φ(s, v)) and p(y|s) = p (y|Φ S (s)). It is an equivalent relation if V is connected and is either open or closed in R d V (Supplement A.1). Here, Φ # [p s,v ] denotes the pushed-forward distribution 3 by Φ, i.e. the distribution of the transformed random variable Φ(s, v) when (s, v) ∼ p s,v . As a reparameterization, Φ allows the two models to have different latent-variable parameterizations while inducing the same distribution on the observed data variables (x, y) (Supplement Lemma A.2). At the heart of the definition, the vconstancy of Φ S implies that Φ is semantic-preserving: one model does not mix the other's v into its s, so that the s variables of both models hold equivalent information. We say that a learned CSG p identifies the semantic factor if it is semantic-equivalent to the groundtruth CSG p * . This identification cannot be characterized by the statistical independence between s and v (as in Cai et al. (2019) ; Ilse et al. (2019) ; Zhang et al. ( 2020)), which is not sufficient (Locatello et al., 2019) nor necessary (due to the existence of spurious correlation). Another related concept is disentanglement. It requires that a semantic transformation on x changes the learned s only (Higgins et al., 2018; Besserve et al., 2020) , while the identification here does not require the learned v to be constant of the ground-truth s. To identify the semantic factor, the ground-truth model could at most provide its information via the data distribution p * (x, y). Although semantic-equivalent CSGs induce the same distribution on (x, y), the inverse is nontrivial. The following theorem shows that the semantic-identifiability can be achieved under appropriate conditions. Theorem 5.4 (semantic-identifiability). With Assumptions 5.1 and 5.2, a well-learned CSG p with p(x, y) = p * (x, y) is semantic-equivalent to the ground-truth CSG p * , if log p(s, v) and log p * (s, v) have bounded derivatives up to the second-order, and that 4 (i) 1 σ 2 µ → ∞ where σ 2 µ := E[µ µ], or (ii) p µ has an a.e. non-zero characteristic function (e.g., a Gaussian distribution).

Remarks. (1)

The requirement on p(s, v) and p * (s, v) excludes extreme training data that show a deterministic s-v relation, which makes the (s, v) density functions unbounded and discontinuous. In that case (e.g., all desks appear in workspace and all beds in bedrooms), one cannot tell whether the label y is caused by s (e.g., the shape) or by v (e.g., the background). (2) In condition (i), 1 σ 2 µ measures the intensity of the causal mechanism p(x|s, v). A strong p(x|s, v) helps disambiguating values of (s, v) in generating a given x. The condition makes p(x|s, v) so strong that it is almost deterministic and invertible, so inference invariance also holds (Section 3.2). Supplement A.2 provides a quantitative reference of large intensity for a practical consideration, and Supplement B gives a non-asymptotic extension showing how the intensity trades-off the tolerance of equalities in Definition 5.3. Condition (ii) covers more than inference invariance. It roughly implies that different values of (s, v) a.s. produce different distributions p(x|s, v) on X , so their roles in generating x become clear which helps identification. (3) The theorem does not contradict the impossibility result by Locatello et al. (2019) , which considers disentangling each latent dimension with an unconstrained (s, v) → (x, y), while we identify s as a whole with the edge v → y removed which breaks the s-v symmetry.

5.2. OOD GENERALIZATION THEORY

The causal invariance Principle 3.2 forms the ground-truth CSG on the test domain as p * = (p * (s, v), p * (x|s, v), p * (y|s)) with the new ground-truth prior p * (s, v), which gives the optimal predictor Ẽ * [y|x] 5 on the test domain. The principle also leads to the invariance of identified causal mechanisms, which shows that the OOD generalization error of a CSG is bounded: Theorem 5.5 (OOD generalization error). With Assumptions 5.1 and 5.2, for a semanticallyidentified CSG p on the training domain with reparameterization Φ, we have up to O(σ 2 µ ) that 3 The definition of Φ # [ps,v] requires Φ to be measurable. This is satisfied by the continuity of Φ as a homeomorphism (as long as the considered σ-field is the Borel σ-field) (Billingsley, 2012, Theorem 13.2 ). 4 To be precise, the semantic-equivalent conclusions are that the equalities in Definition 5.3 hold asymptotically in the limit 1 σ 2 µ → ∞ for condition (i), and hold a.e. for condition (ii). 5 For categorical y, the expectation of y is taken under the one-hot representation. for any x ∈ supp(p x ) ∩ supp(p * x ), E[y|x] - Ẽ * [y|x] σ 2 µ ∇g(s) 2 J f -1 (x) 2 2 ∇ log(p(s, v)/p(s, v)) 2 (s,v)=f -1 (x) , where supp denotes the support of a distribution, J f -1 is the Jacobian matrix of f -1 , and ps,v := Φ # [p * s,v ] is the test-domain prior under the parameterization of the identified CSG p.foot_2  The result shows that when the causal mechanism p(x|s, v) is strong, especially in the extreme case σ µ = 0 where inference invariance also holds, it dominates prediction over the prior and the generalization error diminishes. In more general cases where only causal invariance holds, the prior change deviates the prediction rule. The prior-change term ∇ log(p(s, v)/p(s, v)) 2 measures the hardness or severity of OOD. It diminishes in IID cases, and makes the bound lose its effect when the two priors do not share their support. Using a CSG to fit training data enforces causal invariance and other assumptions, so its E[y|x] behaves more faithfully in low p * (x) area and the boundedness becomes more plausible in practice. CSG-ind further actively uses an independent prior whose larger support covers more ps,v candidates.

5.3. DOMAIN ADAPTATION THEORY

In cases of weak causal mechanism or violent prior change, the new ground-truth prior p * s,v is important for prediction. The domain adaptation method learns a new prior ps,v by fitting unsupervised test-domain data, with causal mechanisms shared. Once the mechanisms are identified, p * s,v can also be identified under the learned parameterization, and prediction can be made precise. Theorem 5.6 (domain adaptation error). Under the conditions of Theorem 5.4, for a semanticallyidentified CSG p on the training domain with reparameterization Φ, if its new prior ps,v for the test domain is well-learned with p (x) = p * (x), then ps,v = Φ # [p * s,v ], and Ẽ[y|x] = Ẽ * [y|x] for any x ∈ supp(p * x ). Different from existing domain adaptation bounds (Supplement D), Theorems 5.5 and 5.6 allow different inference models in the two domains, thus go beyond inference invariance.

6. EXPERIMENTS

For baselines of OOD generalization, apart from the conventional supervised learning optimizing cross entropy (CE), we also consider a causal discriminative method CNBB (He et al., 2019) , and a generative method supervised VAE (sVAE) which is a counterpart of CSG that does not separate its latent variable into s and v. For domain adaptation, we consider well-acknowledged DANN (Ganin et al., 2016) , DAN (Long et al., 2015) and CDAN (Long et al., 2018) methods implemented in the dalib package (Jiang et al., 2020) , and also sVAE using a similar method as CSG-DA. All methods share the same optimization setup. We align the scale of the CE term in the objectives of all methods, and tune their hyperparameters to lie on the margin that makes the final accuracy near 1 on a validation set from the training domain. See Supplement F for details.

6.1. SHIFTED MNIST

We consider an OOD prediction task on MNIST to classify digits "0" and "1". In the training data, "0"s are horizontally shifted at random by δ pixels with δ ∼ N (-5, 1 2 ), and "1"s by δ ∼ N (5, 1 2 ) pixels. We consider two test domains where the digits are not moved, or are shifted δ ∼ N (0, 2 2 ) pixels. Both domains have balanced classes. We implement all methods using multilayer perceptron which is not naturally shift invariant. We use a larger architecture for discriminative and domain adaptation methods to compensate the additional generative components of generative methods. The OOD performance is shown in Table 1 . For OOD generalization, CSG gives more genuine predictions in unseen domains, thanks to the identification of the semantic factor. CSG-ind performs even better, demonstrating the merit of approaching a CSG with an independent prior. Other methods are more significantly misled by the position factor from the spurious correlation. CNBB ameliorates the position bias, but not as thoroughly without explicit structures for causal mechanisms. CSG (0, 2 2 ) 52.7± 2.8 58.0± 1.7 58.6± 5.8 64.8± 2.7 66.5± 4.7 47.0± 2.9 50.1± 2.9 49.2± 5.6 68.0± 7.5 76.0± 3.4   Table 2 (Radford et al., 2015) pretrained on Cifar10. Table 2 shows the results. We see that CSG(-ind) achieves the best OOD generalization result, and performs comparable with modern domain adaptation methods. On this task, the underlying causal mechanism may be very noisy (e.g., photos taken from inside and outside both count for the aircraft class), making identification hard. So CSG-DA does not make a salient improvement.

7. CONCLUSION AND DISCUSSION

We tackle OOD generalization and domain adaptation tasks by proposing a Causal Semantic Generative model (CSG), which builds upon a causal reasoning, and models semantic and variation factors separately while allowing their correlation. Using the invariance principle of causality, we develop effective and delicate methods for learning, adaptation and prediction, and prove the identification of the semantic factor, the boundedness of OOD generalization error, and the success of adaptation under appropriate conditions. Experiments show the improved performance in both tasks. The consideration of separating semantics from variation extends to broader examples regarding robustness. Convolutional neural networks are found to change its prediction under a different texture but the same shape (Geirhos et al., 2019; Brendel & Bethge, 2019) . Adversarial vulnerability (Szegedy et al., 2014; Goodfellow et al., 2015; Kurakin et al., 2016) extends variation factors to human-imperceptible features, i.e. the adversarial noise, which is shown to have a strong spurious correlation with semantics (Ilyas et al., 2019) . The separation also matters for fairness when a sensitive variation factor may change prediction due to a spurious correlation. Our methods are potentially beneficial in these examples.

SUPPLEMENTARY MATERIALS A PROOFS

We first introduce some handy concepts and results to make the proof succinct. We begin with extended discussions on CSG. Definition A.1. A homeomorphism Φ on S × V is called a reparameterization from CSG p to CSG p , if Φ # [p s,v ] = p s,v , and p(x|s, v) = p (x|Φ(s, v)) and p(y|s) = p (y|Φ S (s, v)) for any (s, v) ∈ S × V. A reparameterization Φ is called to be semantic-preserving, if its output dimensions in S is constant of v: Φ S (s, v) = Φ S (s) for any v ∈ V. Note that a reparameterization unnecessarily has its output dimensions in S, i.e. Φ S (s, v), constant of v. The condition that p(y|s ) = p (y|Φ S (s, v)) for any v ∈ V does not indicate that Φ S (s, v) is constant of v, since p (y|s ) may ignore the change of s = Φ S (s, v) from the change of v. The following lemma shows the meaning of a reparameterization: it allows a CSG to vary while inducing the same distribution on the observed data variables (x, y) (i.e., holding the same effect on describing data). Lemma A.2. If there exists a reparameterization Φ from CSG p to CSG p , then p(x, y) = p (x, y). Proof. By the definition of a reparameterization, we have: p(x, y) = p(s, v)p(x|s, v)p(y|s) dsdv = Φ -1 # [p s,v ](s, v)p (x|Φ(s, v))p (y|Φ S (s, v)) dsdv = p s,v (s , v )p (x|s , v )p (y|s ) ds dv = p (x, y), where we used variable substitution (s , v ) := Φ(s, v) in the second-last equality. Note that by the definition of pushed-forward distribution and the bijectivity of v) dsdv (can also be verified deductively using the rule of change of variables, i.e. Lemma A.4 in the following). Φ, Φ # [p s,v ] = p s,v implies p s,v = Φ -1 # [p s,v ], and f (s , v )p s,v (s , v ) ds dv = f (Φ(s, v))Φ -1 # [p s,v ](s, The definition of semantic-equivalence (Definition 5.3) can be rephrased by the existence of a semantic-preserving reparameterization. With appropriate model assumptions, we can show that any reparameterization between two CSGs is semantic-preserving, so that semantic-preserving CSGs cannot be converted to each other by a reparameterization that mixes s with v. Lemma A.3. For two CSGs p and p , if p (y|s) has a statistics M (s) that is an injective function of s, then any reparameterization Φ from p to p , if exists, has its Φ S constant of v. Proof. Let Φ = (Φ S , Φ V ) be any reparameterization from p to p . Then the condition that p(y|s ) = p (y|Φ S (s, v)) for any v ∈ V indicates that M (s) = M (Φ S (s, v)). If there exist s ∈ S and v (1) = v (2) ∈ V such that Φ S (s, v (1) ) = Φ S (s, v (2) ), then M (Φ S (s, v (1) )) = M (Φ S (s, v (2) )) since M is injective. This violates M (s) = M (Φ S (s, v)) which requires both M (Φ S (s, v (1) )) and M (Φ S (s, v (2) )) to be equal to M (s). So Φ S (s, v) must be constant of v. We then introduce two mathematical facts. Lemma A.4 (rule of change of variables). Let z be a random variable on a Euclidean space R d Z with density function p z (z), and let Φ be a homeomorphism on R d Z whose inverse Φ -1 is differentiable. Then the distribution of the transformed random variable z = Φ(z) has a density function Φ # [p z ](z ) = p z (Φ -1 (z ))|J Φ -1 (z )|, where |J Φ -1 (z )| denotes the absolute value of the determi- nant of the Jacobian matrix (J Φ -1 (z )) ia := ∂ ∂z i (Φ -1 ) a (z ) of Φ -1 at z . Proof. See e.g., Billingsley (2012, Theorem 17.2) . Note that a homeomorphism is (Borel) measurable since it is continuous (Billingsley, 2012, Theorem 13. 2), so the definition of Φ # [p z ] is valid. Lemma A.5. Let µ be a random variable whose characteristic function is a.e. non-zero. For two functions f and f on the same space, we have: f * p µ = f * p µ ⇐⇒ f = f a.e., where (f * p µ )(x) := f (x)p µ (x -µ) dµ denotes convolution. Proof. The function equality f * p µ = f * p µ leads to the equality under Fourier transformation Proof. Let Φ be a semantic-preserving reparameterization from one CSG p = (p(s, v), p(x|s, v), p(y|s)) to another p = (p (s, v), p (x|s, v), p (y|s)). It has its Φ S constant of v, so we can write F [f * p µ ] = F [f * p µ ], which gives F [f ]F [p µ ] = F [f ]F [p µ ]. Since F [p µ ] is the characteristic function of p µ , Φ(s, v) = (Φ S (s), Φ V (s, v)) =: (φ(s), ψ s (v)). (1) We first show that φ, and ψ s for any s ∈ S, are homeomorphisms on S and V, respectively, and that Φ -1 (s , v ) = (φ -1 (s ), ψ -1 φ -1 (s ) (v )). • Since Φ(S × V) = S × V, so φ(S) = Φ S (S) = S, so φ is surjective. • Suppose that there exists s ∈ S such that φ -1 (s ) = {s (i) } i∈I contains multiple distinct elements. 1. Since Φ is surjective, for any v ∈ V, there exist i ∈ I and v ∈ V such that (s , v ) = Φ(s (i) , v) = (φ(s (i) ), ψ s (i) (v)), which means that i∈I ψ s (i) (V) = V. 2. Since Φ is injective, the sets {ψ s (i) (V)} i∈I must be mutually disjoint. Otherwise, there would exist i = j ∈ I and v (1) , v (2) ∈ V such that ψ s (i) (v (1) ) = ψ s (j) (v (2) ) thus Φ(s (i) , v (1) ) = (s , ψ s (i) (v (1) )) = (s , ψ s (j) (v (2) )) = Φ(s (j) , v (2) ), which violates the injectivity of Φ since s (i) = s (j) . 3. In the case where V is open, then so is any ψ s (i) (V) = Φ(s (i) , V) since Φ is contin- uous. But the union of disjoint open sets i∈I ψ s (i) (V) = V cannot be connected. This violates the condition that V is connected. 4. A similar argument holds in the case where V is closed. So φ -1 (s ) contains only one unique element for any s ∈ S. So φ is injective. • The above argument also shows that for any s ∈ S, we have i∈I ψ s (i) (V) = ψ φ -1 (s ) (V) = V. For any s ∈ S, there exists s ∈ S such that s = φ -1 (s ), so we have ψ s (V) = V. So ψ s is surjective for any s ∈ S. • Suppose that there exist v (1) = v (2) ∈ V such that ψ s (v (1) ) = ψ s (v (2) ). Then Φ(s, v (1) ) = (φ(s), ψ s (v (1) )) = (φ(s), ψ s (v (2) )) = Φ(s, v (2) ), which contradicts the injectivity of Φ since v (1) = v (2) . So ψ s is injective for any s ∈ S. • That Φ is continuous and Φ(s, v) = (φ(s), ψ s (v)) indicates that φ and ψ s are continuous. For any (s , v ) ∈ S × V, we have Φ(φ -1 (s ), ψ -1 φ -1 (s ) (v )) = (φ(φ -1 (s )), ψ φ -1 (s ) (ψ -1 φ -1 (s ) (v ))) = (s , v ). Applying Φ -1 to both sides gives Φ -1 (s , v ) = (φ -1 (s ), ψ -1 φ -1 (s ) (v )). • Since Φ -1 is continuous, φ -1 and ψ -1 s are also continuous. (2) We now show that the relation is an equivalence relation. It amounts to showing the following three properties. • Reflexivity. For two identical CSGs, we have p(s, v) = p (s, v), p(x|s, v) = p (x|s, v) and p(y|s) = p (y|s). So the identity map as Φ obviously satisfies all the requirements. • Symmetry. Let Φ be a semantic-preserving reparameterization from p = (p(s, v), p(x|s, v), p(y|s)) to p = (p (s, v), p (x|s, v), p (y|s)). From the above conclusion in (1), we know that (Φ -1 ) S (s , v ) = φ -1 (s ) is semantic-preserving. Also, Φ -1 is a homeomorphism on S × V since Φ is. So we only need to show that Φ -1 is a reparameterization from p to p for symmetry. 1. From the definition of pushed-forward distribution, we have Φ -1 # [p s,v ] = p s,v if Φ # [p s,v ] = p s,v . It can also be verified through the rule of change of variables (Lemma A.4) when Φ and Φ -1 are differentiable. From Φ # [p s,v ] = p s,v , we have for any (s , v ), p s,v (Φ -1 (s , v ))|J Φ -1 (s , v )| = p s,v (s , v ). Since for any (s, v) there exists (s , v ) such that (s, v) = Φ -1 (s , v ), this implies that for any (s, v), p s,v (s, v)|J Φ -1 (Φ(s, v))| = p s,v (Φ(s, v)), or p s,v (s, v) = p s,v (Φ(s, v))/|J Φ -1 (Φ(s, v))| = p s,v (Φ(s, v))|J Φ (s, v)| (inverse function theorem), which means that p s,v = Φ -1 # [p s,v ] by the rule of change of variables. 2. For any (s , v ), there exists (s, v) such that (s , v ) = Φ(s, v), so p (x|s , v ) = p (x|Φ(s, v)) = p(x|s, v) = p(x|Φ -1 (s , v )) , and p (y|s ) = p (y|Φ S (s)) = p(y|s) = p(y|(Φ -1 ) S (s )). So Φ -1 is a reparameterization from p to p. • Transitivity. Given a third CSG p = (p (s, v), p (x|s, v), p (y|s)) that is semanticequivalent to p , there exists a semantic-preserving reparameterization Φ from p to p . It is easy to see that (Φ • Φ) S (s, v) = Φ S (Φ S (s, v)) = Φ S (Φ S (s)) is constant of v thus semantic-preserving. As the composition of two homeomorphisms Φ and Φ on S × V, Φ • Φ is also a homeomorphism. So we only need to show that Φ • Φ is a reparameterization from p to p for transitivity. 1. From the definition of pushed-forward distribution, we have (Φ • Φ) # [p s,v ] = Φ # [Φ # [p s,v ]] = Φ # [p s,v ] = p s,v if Φ # [p s,v ] = p s,v and Φ # [p s,v ] = p s,v . It can also be verified through the rule of change of variables (Lemma A.4) when Φ -1 and Φ -1 are differentiable. For any (s , v ), we have (Φ • Φ) # [p s,v ](s , v ) = p s,v ((Φ • Φ) -1 (s , v )) J (Φ •Φ) -1 (s , v ) = p s,v (Φ -1 (Φ -1 (s , v ))) J Φ -1 (Φ -1 (s , v )) |J Φ -1 (s , v )| = Φ # [p s,v ](Φ -1 (s , v ))|J Φ -1 (s , v )| = p s,v (Φ -1 (s , v ))|J Φ -1 (s , v )| = Φ # [p s,v ](s , v ) = p s,v (s , v ). 2. For any (s, v), we have: p(x|s, v) = p (x|Φ(s, v)) = p (x|Φ (Φ(s, v))) = p (x|(Φ • Φ)(s, v)), p(y|s) = p (y|Φ S (s)) = p (y|Φ S (Φ S (s))) = p (y|(Φ • Φ) S (s)). So Φ • Φ is a reparameterization from p to p . This completes the proof for an equivalence relation. A.2 PROOF OF THE SEMANTIC-IDENTIFIABILITY THEOREM 5.4 We present a more general and detailed version of Theorem 5.4 and prove it. The theorem in the main context corresponds to conclusions (ii) and (i) below by taking the two CSGs p and p as the well-learned p and the ground-truth CSGs p * , respectively. Theorem 5.4' (semantic-identifiability). Consider CSGs p and p that have Assumptions 5.1 and 5.2 hold, with the bounded derivative conditions specified to be that for both CSGs, f -1 and g are twice and f thrice differentiable with mentioned derivatives bounded. Further assume that their priors have bounded densities and their log p(s, v) have bounded derivatives up to the second-order. If the two CSGs have p(x, y) = p (x, y), then they are semantic-equivalent, under the conditions that: 7 (i) p µ has an a.e. non-zero characteristic function (e.g., a Gaussian distribution); (ii) 1 σ 2 µ → ∞, where σ 2 µ := E[µ µ]; (iii) 1 σ 2 µ B 2 f -1 max{B log p B g + 1 2 B g + 3 2 dB f -1 B f B g , B p B d f -1 (B 2 log p +B log p +3dB f -1 B f B log p + 3d 3 2 B 2 f -1 B 2 f + d 3 B f B f -1 )} , where d := d S + d V , and for both CSGs, the constant B p bounds p(s, v), B f -1 , B g , B log p and B f , B g , B log p bound the 2-norms 8 of the gradient/Jacobian and the Hessians of the respective functions, and B f bounds all the 3rd-order derivatives of f . 7 To be precise, the conclusions are that the equalities in Definition 5.3 hold a.e. for condition (i), hold asymptotically in the limit 1 σ 2 µ → ∞ for condition (ii), and hold up to a negligible quantity for condition (iii). 8 As an induced operator norm for matrices (not the Frobenius norm). Proof. Without loss of generality, we assume that µ and ν (for continuous y) have zero mean. If it is not, we can redefine f (s, v) := f (s, v) + E[µ] and µ := µ -E[µ] (similarly for ν for continuous y) which does not alter the joint distribution p(s, v, x, y) nor violates any assumptions. Also without loss of generality, we consider one scalar component (dimension) l of y, and abuse the use of symbols y and g for y l and g l to avoid unnecessary complication. Note that for continuous y, due to the additive noise structure y = g(s)+ν and that ν has zero mean, we also have E[y|s] = g(s) as the same as the categorical y case (under the one-hot representation). We sometimes denote z := (s, v) for convenience. First note that for both CSGs and both continuous and categorical y, by construction g(s) is a sufficient statistics of p(y|s) (not only the expectation E[y|s]), and it is injective. So by Lemma A.3, we only need to show that there exists a reparameterization from p to p . We will show that Φ := f -1 • f is such a reparameterization. Since f and f are bijective and continuous, we have Φ -1 = f -1 • f , so Φ is bijective and Φ and Φ -1 are continuous. So Φ is a homeomorphism. Also, by construction, we have: p(x|z) = p µ (x -f (z)) = p µ (x -f (f -1 (f (z)))) = p µ (x -f (Φ(z))) = p (x|Φ(z)). So we only need to show that p(x, y) = p (x, y) indicates Φ # [p z ] = p z and p(y|s) = p (y|Φ S (s, v)), ∀v ∈ V under the conditions. Proof under condition (i). We begin with a useful reformulation of the integral t(z)p(x|z) dz for a general function t of z. We will encounter integrals in this form. By Assumption 5.1, we have p(x|z) = p µ (x -f (z)), so we consider a transformation Ψ x (z) := x -f (z) and let µ = Ψ x (z). It is invertible, Ψ -1 x (µ) = f -1 (x -µ), and J Ψ -1 x (µ) = -J f -1 (x -µ) . By these definitions and the rule of change of variables, we have: t(z)p(x|z) dz = t(z)p µ (Ψ x (z)) dz = t(Ψ -1 x (µ))p(µ) J Ψ -1 x (µ) dµ = t(f -1 (x -µ))p(µ) J f -1 (x -µ) dµ = E p(µ) [( tV )(x -µ)] (8) = (f # [t] * p µ )(x), where we have denoted functions t := t • f -1 , V := J f -1 , and abused the push-forward notation f # [t] for a general function t to formally denote (t • f -1 ) J f -1 = tV . According to the graphical structure of CSG, we have: p(x) = p(z)p(x|z) dz, E[y|x] = 1 p(x) yp(x, y) dy = 1 p(x) yp(z)p(x|z)p(y|s) dzdy = 1 p(x) p(z)p(x|z)E[y|s] dz = 1 p(x) g(s)p(z)p(x|z) dz. So from Eq. ( 9), we have: p(x) = (f # [p z ] * p µ )(x), E[y|x] = 1 p(x) (f # [gp z ] * p µ )(x). ( ) Matching the data distribution p(x, y) = p (x, y) indicates both p(x) = p (x) and E[y|x] = E [y|x]. Using Lemma A.5 under condition (i), this further indicates f # [p z ] = f # [p z ] a.e. and f # [gp z ] = f # [g p z ] a.e. The former gives Φ # [p z ] = p z . The latter can be reformed as ḡf # [p z ] = ḡ f # [p z ] a.e., so ḡ = ḡ a.e., where we have denoted ḡ := g • (f -1 ) S and ḡ := g • (f -1 ) S similarly. From ḡ = ḡ , we have for any v ∈ V, g(s) = g((f -1 • f ) S (s, v)) = g((f -1 ) S (f (s, v))) = ḡ(f (s, v)) = ḡ (f (s, v)) = g ((f -1 ) S (f (s, v))) = g (Φ S (s, v)). For both continuous and categorical y, g(s) uniquely determines p(y|s). So the above equality means that p(y|s) = p (y|Φ S (s, v)) for any v ∈ V. Proof under condition (ii). Applying Eq. ( 8) to Eqs. (10, 11), we have: p(x) = E p(µ) [(p z V )(x -µ)], E[y|x] = 1 p(x) E p(µ) [(ḡ pz V )(x -µ)], where we have similarly denoted pz := p z • f -1 . Under condition (ii), E[µ µ] is infinitesimal, so we can expand the expressions w.r.t µ. For p(x), we have: p(x) = E p(µ) pz V -∇(p z V ) µ + 1 2 µ ∇∇ (p z V )µ + O(E[ µ 3 2 ]) = pz V + 1 2 E p(µ) µ ∇∇ (p z V )µ + O(σ 3 µ ), where all functions are evaluated at x. For E[y|x], we first expand 1/p(x) using 1 x+ε = 1 x -ε x 2 + O(ε 2 ) to get: 1 p(x) = 1 pz V -1 2 p2 z V 2 E p(µ) µ ∇∇ (p z V )µ + O(σ 3 µ ). The second term is expanded as: ḡ pz V + 1 2 E p(µ) µ ∇∇ (ḡ pz V )µ + O(σ 3 µ ). Combining the two parts, we have: E[y|x] = ḡ + 1 2 E p(µ) µ (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ µ + O(σ 3 µ ). ( ) This equation holds for any x ∈ supp(p x ) since the expectation is taken w.r.t the distribution p(x, y); in other words, the considered x here is any value generated by the model. So up to O(σ 2 µ ), |p(x) -(p z V )(x)| = 1 2 E p(µ) µ ∇∇ (p z V )µ 1 2 E p(µ) µ ∇∇ (p z V )µ 1 2 E p(µ) µ 2 ∇∇ (p z V ) 2 µ 2 = 1 2 E[µ µ] ∇∇ (p z V ) 2 = 1 2 E[µ µ]|p z V | ∇∇ log pz V + (∇ log pz V )(∇ log pz V ) 2 1 2 E[µ µ]|p z V | ∇∇ log pz V 2 + ∇ log pz V 2 2 , |E[y|x] -ḡ(x)| = 1 2 E p(µ) µ (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ µ 1 2 E p(µ) µ (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ µ 1 2 E p(µ) µ 2 (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ 2 µ 2 1 2 E[µ µ] (∇ log pz V )∇ḡ 2 + ∇ḡ(∇ log pz V ) 2 + ∇∇ ḡ 2 = E[µ µ] (∇ log pz V ) ∇ḡ + 1 2 ∇∇ ḡ 2 . ( ) Given the bounding conditions in the theorem, the multiplicative factors to E[µ µ] in the last expressions are bounded by a constant. So when 1 σ 2 µ → ∞, i.e. E[µ µ] → 0, we have p(x) and E[y|x] converge uniformly to (p z V )(x) = f # [p z ](x) and ḡ(x), respectively. So p(x, y) = p (x, y) indicates f # [p z ] = f # [p z ] and ḡ = ḡ , which means Φ # [p z ] = p z and p(y|s) = p (y|Φ S (s, v)) for any v ∈ V, due to Eq. ( 13) and the explanation that follows. Proof under condition (iii). We only need to show that when 1 σ 2 µ is much larger than the given quantity, we still have p(x, y) = p (x, y) =⇒ pz V = p z V , ḡ = ḡ up to a negligible effect. This task amounts to showing that the residuals |p(x) -(p z V )(x)| and |E[y|x] -ḡ(x)| controlled by Eqs. (15, 16) are negligible. To achieve this, we need to further expand the controlling functions using derivatives of f , g and p z explicitly, and bound them by the bounding constants. In the following, we use indices a, b, c for the components of x and i, j, k for those of z. For functions of z appearing in the following (e.g., f , g, p z and their derivatives), they are evaluated at z = f -1 (x) since we are bounding functions of x. (1) Bounding |E[y|x] -ḡ(x)| E[µ µ] (∇ log pz V ) ∇ḡ + 1 2 ∇∇ ḡ 2 from Eq. ( 16). From the chain rule of differentiation, it is easy to show that: ∇ log pz = J f -1 ∇ log p z , ∇ḡ = J (f -1 ) S ∇g = J f -1 ∇ z g, where ∇ z g = (∇g , 0 d V ) (recall that g is a function only of s). For the term ∇ log V , we apply Jacobi's formula for the derivative of the log-determinant: ∂ a log V (x) = ∂ a log J f -1 (x) = tr J -1 f -1 (x) ∂ a J f -1 (x) = b,i J -1 f -1 (x) ib ∂ a J f -1 (x) bi = b,i J f (f -1 (x)) ib ∂ b ∂ a f -1 i (x) = i J f (∇∇ f -1 i ) ia . However, as bounding Eq. ( 17) already requires bounding J f -1 2 , directly using this expression to bound ∇ log V 2 would require to also bound J f 2 . This requirement to bound the first-order derivatives of both f and f -1 is a relatively restrictive one. To ease the requirement, we would like to express ∇ log V in terms of J f -1 . This can be achieved by expressing ∇∇ f -1 i 's in terms of ∇∇ f c 's. To do this, first consider a general invertible-matrix-valued function A(α) on a scalar α. We have 0 = ∂ α A(α) -1 A(α) = (∂ α A -1 )A + A -1 ∂ α A, so we have A -1 ∂ α A = -(∂ α A -1 )A, consequently ∂ α A = -A(∂ α A -1 )A. Using this relation (in the fourth equality below), we have: ∇∇ f -1 i ab = ∂ a ∂ b f -1 i = ∂ a J f -1 bi = ∂ a J f -1 bi = -J f -1 (∂ a J -1 f -1 )J f -1 bi = -J f -1 ∂ a J f J f -1 bi = - jc (J f ) bj ∂ a (∂ j f c ) (J f -1 ) ci = - jck (J f -1 ) bj (∂ k ∂ j f c )(∂ a f -1 k )(J f -1 ) ci = - c (J f -1 ) ci jk (J f -1 ) bj (∂ k ∂ j f c )(J f -1 ) ak = - c (J f -1 ) ci J f -1 (∇∇ f c )J f -1 ab , or in matrix form, ∇∇ f -1 i = - c (J f -1 ) ci J f -1 (∇∇ f c )J f -1 =: - c (J f -1 ) ci K c , where we have defined the matrix K c := J f -1 (∇∇ f c )J f -1 which is symmetric. Substituting with this result, we can transform Eq. ( 18) into a desired form: ∇ log V (x) = i J f (∇∇ f -1 i ) i: = - i J f c (J f -1 ) ci J f -1 (∇∇ f c )J f -1 i: = - i c (J f -1 ) ci J f J -1 f (∇∇ f c )J f -1 i: = - ci (J f -1 ) ci (∇∇ f c )J f -1 i: = - c J f -1 (∇∇ f c )J f -1 c: = - c (K c c: ) = - c K c :c , so its norm can be bounded by: ∇ log V (x) 2 = c K c c: 2 = c (J f -1 ) c: (∇∇ f c )J f -1 2 c (J f -1 ) c: 2 ∇∇ f c 2 J f -1 2 B f B f -1 c (J f -1 ) c: 2 dB 2 f -1 B f , where we have used the following result in the last inequality: c (J f -1 ) c: 2 d 1/2 c (J f -1 ) c: 2 2 = d 1/2 J f -1 F d J f -1 2 dB f -1 . ( ) Integrating Eq. ( 17) and Eq. ( 21), we have: (∇ log pz V ) ∇ḡ = (J f -1 ∇ log p z + ∇ log V ) J f -1 ∇ z g J f -1 2 ∇ log p z 2 + ∇ log V 2 J f -1 ∇g 2 B f -1 B log p + dB 2 f -1 B f B f -1 B g = B log p + dB f -1 B f B 2 f -1 B g . ( ) For the Hessian of ḡ, direct calculus gives: ∇∇ ḡ = J (f -1 ) S (∇∇ g)J (f -1 ) S + d S i=1 (∇g) si (∇∇ f -1 si ) = J f -1 (∇ z ∇ z g)J f -1 + i (∇ z g) i (∇∇ f -1 i ). To avoid the requirement of bounding both ∇∇ f c 's and ∇∇ f -1 i 's, we substitute ∇∇ f -1 i using Eq. ( 19): ∇∇ ḡ = J f -1 (∇ z ∇ z g)J f -1 - i (∇ z g) i c (J f -1 ) ci K c = J f -1 (∇ z ∇ z g)J f -1 - c (J f -1 ) c,: (∇ z g) K c . So its norm can be bounded by: ∇∇ ḡ 2 J f -1 2 2 ∇∇ g 2 + c (J f -1 ) c: (∇ z g) K c 2 B 2 f -1 B g + c (J f -1 ) c: (∇ z g) B 2 f -1 B f B 2 f -1 B g + B f c (J f -1 ) c: 2 ∇ z g 2 B 2 f -1 B g + B f B g c (J f -1 ) c: 2 B 2 f -1 B g + dB f -1 B f B g , where we have used Eq. ( 22) in the last inequality. Assembling Eq. ( 23) and Eq. ( 24) into Eq. ( 16), we have: |E[y|x] -ḡ(x)| E[µ µ]B 2 f -1 B log p B g + 1 2 B g + 3 2 dB f -1 B f B g . ( ) So given the condition (iii), this residual can be neglected. ( ) Bounding |p(x) -(p z V )(x)| 1 2 E[µ µ]|p z V | ∇ log pz V 2 2 + ∇∇ log pz 2 + ∇∇ log V 2 from Eq. (15). To begin with, for any x, pz (x) = p z (f -1 (x)) B p , and V (x) = J f -1 (x) is the product of absolute eigenvalues of J f -1 (x). Since J f -1 (x) 2 is the largest absolute eigenvalue of J f -1 (x), so V (x) J f -1 (x) d 2 B d f -1 . For the first norm in the bracket of the r.h.s of Eq. ( 15), we have: ∇ log pz V 2 2 = ∇ log pz 2 2 + 2(∇ log pz ) ∇ log V + ∇ log V 2 2 ∇ log pz 2 2 + 2 ∇ log pz 2 ∇ log V 2 + ∇ log V 2 B 2 f -1 B 2 log p + 2dB 3 f -1 B f B log p + ∇ log V 2 2 , where we have utilized Eq. ( 17) and Eq. ( 21) in the last inequality. We consider bounding ∇ log V 2 2 separately. Using Eq. ( 20) (in the second equality below), we have: ∇ log V 2 2 = (∇ log V ) (∇ log V ) = c (K c :c ) d K d :d = cd K c c: K d :d cd K c c: K d :d = cd (J f -1 ) c: (∇∇ f c )J f -1 J f -1 (∇∇ f d )(J f -1 ) d: cd (J f -1 ) c: (J f -1 ) d: (∇∇ f c )J f -1 J f -1 (∇∇ f d ) 2 cd (J f -1 ) c: (J f -1 ) d: B 2 f -1 B 2 f = B 2 f -1 B 2 f cd (J f -1 J f -1 ) cd d 3/2 B 2 f -1 B 2 f J f -1 J f -1 2 d 3/2 B 4 f -1 B 2 f , where we have used the facts for general matrix A and (column) vectors α, β that α Aβ = α(Aβ) 2 = αβ A 2 αβ 2 A 2 = α β A 2 (28) in the fifth last inequality, and that cd |A cd | √ d 2 cd |A cd | 2 = d A F d 3/2 A 2 (29) in the second last inequality. Substituting Eq. ( 27) into Eq. ( 26), we have: ∇ log pz V 2 2 B 2 f -1 B 2 log p + 2dB 3 f -1 B f B log p + d 3/2 B 4 f -1 B 2 f . ( ) For the second norm in the bracket of the r.h.s of Eq. ( 15), similar to Eq. ( 24), we have: ∇∇ log pz 2 B 2 f -1 B log p + dB f -1 B f B log p . ( ) The third norm ∇∇ log V 2 in the bracket of the r.h.s of Eq. ( 15) needs some more effort. From Eq. ( 20), we have ∂ b log V = -cij (J f -1 ) ci (∂ i ∂ j f c )(J f -1 ) bj , thus ∂ a ∂ b log V = - cij ∂ a (J f -1 ) ci (∂ i ∂ j f c )(J f -1 ) bj - cij (J f -1 ) ci (∂ i ∂ j f c )∂ a (J f -1 ) bj - cij (J f -1 ) ci ∂ a (∂ i ∂ j f c )(J f -1 ) bj = - cij (∂ a ∂ c f -1 i )(∂ i ∂ j f c )(J f -1 ) bj - cij (J f -1 ) ci (∂ i ∂ j f c )(∂ a ∂ b f -1 j ) - cijk (J f -1 ) ci (∂ a f -1 k )(∂ k ∂ i ∂ j f c )(J f -1 ) bj = cijd (J f -1 ) di K d ac (∂ i ∂ j f c )(J-1) bj + cijd (J f -1 ) ci (∂ i ∂ j f c )(J f -1 ) dj K d ab - cijk (J f -1 ) ci (∂ k ∂ i ∂ j f c )(J f -1 ) ak (J f -1 ) bj = cd K d ac K c db + cd K c cd K d ab - cijk (J f -1 ) ci (∂ k ∂ i ∂ j f c )(J f -1 ) ak (J f -1 ) bj , where we have used Eq. ( 19) in the third equality for the first two terms. In matrix form, we have: ∇∇ log V = cd K d :c K c d: + cd K c cd K d - cijk (J f -1 ) ci (∂ k ∂ i ∂ j f c )(J f -1 ) :k (J f -1 ) :j . We now bound the norms of the three terms in turn. For the first term, cd K d :c K c d: 2 cd K d :c K c d: 2 = cd K c d: K d :c = cd (J f -1 ) d: (∇∇ f c )J f -1 J f -1 (∇∇ f d )(J f -1 ) c: cd (J f -1 ) d: (J f -1 ) c: (∇∇ f c )J f -1 J f -1 (∇∇ f d ) 2 B 2 f -1 B 2 f cd (J f -1 J f -1 ) dc d 3/2 B 2 f -1 B 2 f J f -1 J f -1 2 d 3/2 B 4 f -1 B 2 f , where we have used Eq. ( 28) in the fourth last inequality and Eq. ( 29) in the second last inequality. For the second term, cd K c cd K d 2 cd |K c cd | K d 2 B 2 f -1 B f cd |K c cd | d 1/2 B 2 f -1 B f c d |K c cd | 2 = d 1/2 B 2 f -1 B f c K c c: 2 d 1/2 B 2 f -1 B f c (J f -1 ) c: 2 (∇∇ f c )J f -1 2 d 1/2 B 3 f -1 B 2 f c (J f -1 ) c: 2 d 3/2 B 4 f -1 B 2 f , where we have used Eq. ( 22) in the last inequality. For the third term, cijk (J f -1 ) ci (∂ k ∂ i ∂ j f c )(J f -1 ) :k (J f -1 ) :j 2 cijk (J f -1 ) ci (∂ k ∂ i ∂ j f c ) (J f -1 ) :k (J f -1 ) :j 2 B f ci (J f -1 ) ci jk (J f -1 ) :k (J f -1 ) :j 2 d 3/2 B f J f -1 2 jk (J f -1 ) :k (J f -1 ) :j d 3/2 B f B f -1 jk (J f -1 J f -1 ) kj d 3 B f B f -1 J f -1 J f -1 2 d 3 B f B 3 f -1 , where we have used Eq. ( 29) in the fourth last and second last inequalities. Finally, by assembling Eqs. (30, 31, 32, 33, 34) into Eq. ( 15), we have: |p(x) -(p z V )(x)| 1 2 E[µ µ]B p B d f -1 B 2 f -1 B 2 log p + 2dB 3 f -1 B f B log p + d 3/2 B 4 f -1 B 2 f + B 2 f -1 (B log p + dB f -1 B f B log p ) + 2d 3/2 B 4 f -1 B 2 f + d 3 B f B 3 f -1 = 1 2 E[µ µ]B p B d+2 f -1 B 2 log p + B log p + 3dB f -1 B f B log p + 3d 3/2 B 2 f -1 B 2 f + d 3 B f B f -1 . So given the condition (iii), this residual can be neglected.

A.3 PROOF OF THE OOD GENERALIZATION ERROR BOUND THEOREM 5.5

We give the following more detailed version of Theorem 5.5 and prove it. The theorem in the main context corresponds to conclusion (ii) below (i.e., Eq. ( 37) below recovers Eq. ( 6)) by taking the CSGs p , p and p as the semantic-identified CSG p on the training domain and the groundtruth CSGs on the training p * and test p * domains, respectively. Here, the semantic-identification requirement on the learned CSG p is to guarantee that it is semantic-equivalent to the ground-truth CSG p * on the training domain, so that the condition in (ii) is satisfied.  E[y|x] -Ẽ[y|x] σ 2 µ ∇g 2 J f -1 2 2 ∇ log(p s,v /p s,v ) 2 (s,v)=f -1 (x) , ( ) where J f -1 is the Jacobian of f -1 . Further assume that the bounds B's defined in Theorem 5.4'(iii) hold. Then the error is negligible for any x ∈ supp(p x ) ∩ supp(p x ) if 1 σ 2 µ B log p B g B 2 f -1 , and: E p(x) E[y|x] -Ẽ[y|x] 2 σ 4 µ B 2 g B 4 f -1 E ps,v [2∆ log p s,v -∆ log ps,v + ∇ log p s,v (ii) Let p be a CSG that is semantic-equivalent to the CSG p introduced in (i). Then up to O(σ 2 µ ), we have for any x ∈ supp(p x ) ∩ supp(p x ), E [y|x] -Ẽ[y|x] σ 2 µ ∇g 2 J f -1 2 2 ∇ log(p s,v /p s,v ) 2 (s,v)=f -1 (x) , ( ) where p s,v := Φ # [p s,v ] is the prior of CSG p under the parameterization of CSG p , derived as the pushed-forward distribution by the reparameterization Φ := f -1 • f from p to p . For conclusion (i), in the expected OOD generalization error in Eq. ( 36), the term E ps,v [2∆ log p s,v - ∆ log ps,v + ∇ log p s,v 2 ] is actually the score matching objective (Fisher divergence) (Hyvärinen, 2005) that measures the difference between ps,v and p s,v . For Gaussian priors p(s, v) = N (0, Σ) and p(s, v) = N (0, Σ), the term reduces to the matrix trace, tr(-2Σ -1 + Σ-1 + Σ -1 ΣΣ -1 ). For Σ = Σ, the term vanishes. For conclusion (ii), note that since p and p are semantic-equivalent, we have p x = p x and E [y|x] = E[y|x] (from Lemma A.2). So Eqs. (35, 37) bound the same quantity. Equation ( 37) expresses the bound using the structures of the CSG p . It is considered since recovering the exact CSG p from (x, y) data is impractical and we can only learn a CSG p that is semantic-equivalent to p. Proof. Following the proof A.2 of Theorem 5.4', we assume the additive noise variables µ and ν (for continuous y) have zero mean without loss of generality, and we denote z := (s, v). Proof under condition (i). Under the assumptions, we have Eq. ( 14) in the proof A.2 of Theorem 5.4' hold. Noting that the two CSGs share the same ḡ and V (since they share the same p(x|s, v) and p(y|s) thus f and g), we have for any x ∈ supp(p x ) ∩ supp(p x ), E[y|x] = ḡ + 1 2 E p(µ) µ (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ µ + O(σ 3 µ ), Ẽ[y|x] = ḡ + 1 2 E p(µ) µ (∇ log pz V )∇ḡ + ∇ḡ(∇ log pz V ) + ∇∇ ḡ µ + O(σ 3 µ ), ( ) where we have similarly defined pz := pz • f -1 . By subtracting the two equations, we have that up to O(σ 2 µ ), E[y|x] -Ẽ[y|x] = 1 2 E p(µ) µ ∇ log(p z / pz )∇ḡ + ∇ḡ∇ log(p z / pz ) µ 1 2 E p(µ) µ ∇ log(p z / pz )∇ḡ + ∇ḡ∇ log(p z / pz ) µ 1 2 E p(µ) µ 2 2 ∇ log(p z / pz )∇ḡ 2 + ∇ḡ∇ log(p z / pz ) 2 = ∇ḡ ∇ log(p z / pz ) E[µ µ]. ( ) The multiplicative factor to E[µ µ] on the right hand side can be further bounded by: ∇ḡ ∇ log(p z / pz ) = (J (f -1 ) S ∇g) (J f -1 ∇ log(p z /p z )) = ∇g J (f -1 ) S J f -1 ∇ log(p z /p z ) = ((∇g) , 0 d V )J f -1 J f -1 ∇ log(p z /p z ) ∇g 2 J f -1 2 2 ∇ log(p z /p z ) 2 , ( ) where ∇g and ∇ log(p z /p z ) are evaluated at z = f -1 (x). This gives: 35) in conclusion (i). When the bounds B's in Theorem 5.4'(iii) hold, we further have: E[y|x] -Ẽ[y|x] σ 2 µ ∇g 2 J f -1 2 2 ∇ log(p z /p z ) 2 , i.e. Eq. ( E[y|x] -Ẽ[y|x] σ 2 µ ∇g 2 J f -1 2 2 ∇ log p z -∇ log pz 2 σ 2 µ ∇g 2 J f -1 2 2 ( ∇ log p z 2 + ∇ log pz 2 ) 2σ 2 µ B g B 2 f -1 B log p . So when 1 σ 2 µ B log p B g B 2 f -1 , this difference is negligible for any x ∈ supp(p x ) ∩ supp(p x ). We now turn to the expected OOD generalization error Eq. ( 36) in conclusion (i). When supp(p x ) = supp(p x ), Eq. ( 35) hold on px . Together with the bounds in Theorem 5.4'(iii), we have: E p(x) E[y|x] -Ẽ[y|x] 2 σ 4 µ B 2 g B 4 f -1 E p(x) ∇ log(p z /p z ) z=f -1 (x) 2 2 = σ 4 µ B 2 g B 4 f -1 E pz ∇ log(p z /p z ) 2 2 , where the equality holds due to the generating process of the model. Note that the term E pz ∇ log(p z /p z ) 2 2 therein is the score matching objective (Fisher divergence). By Hyvärinen (2005, Theorem 1), we can reformulate it as E pz [2∆ log p z -∆ log pz + ∇ log p z 2 2 ], so we have: E p(x) E[y|x] -Ẽ[y|x] 2 σ 4 µ B 2 g B 4 f -1 E pz [2∆ log p z -∆ log pz + ∇ log p z 2 2 ]. Proof under condition (ii). From Eq. ( 14) in the proof A.2 of Theorem 5.4', we have for CSG p that for any x ∈ supp(p x ) or equivalently x ∈ supp(p x ), E [y|x] = ḡ + 1 2 E p(µ) µ (∇ log p z V )∇ḡ + ∇ḡ (∇ log p z V ) + ∇∇ ḡ µ + O(σ 3 µ ), ( ) where we have similarly defined p z := p z • f -1 and ḡ := g • (f -1 ) S . Since p and p are semantic-equivalent with reparameterization Φ from p to p , we have p(y|s (41, 38) and applying these two facts, we have up to O(σ 2 µ ), for any x ∈ supp(p x ) ∩ supp(p x ), ) = p (y|Φ S (s, v)) thus g(s) = g (Φ S (s, v)) for any v ∈ V. So for any x ∈ supp(p x ) or equivalently x ∈ supp(p x ), we have g((f -1 ) S (x)) = g (Φ S ((f -1 ) S (x), (f -1 ) V (x))) = g (Φ S (f -1 (x))) = g ((f -1 ) S (f (f -1 (x)))) = g ((f -1 ) S (x)), i.e., ḡ = ḡ . For another fact, since p z := Φ # [p z ] = (f -1 •f ) # [p z ] by definition, we have f # [p z ] = f # [p z ], i.e., p z V = pz V . Subtracting Eqs. E [y|x] -Ẽ[y|x] = 1 2 E p(µ) µ ∇ log(p z / p z )∇ḡ + ∇ḡ ∇ log(p z / p z ) µ ∇ḡ ∇ log(p z / p z ) E[µ µ], where the inequality follows Eq. ( 39). Using a similar result of Eq. ( 40), we have: E [y|x] -Ẽ[y|x] σ 2 µ ∇g 2 J f -1 2 2 ∇ log(p z /p z ) 2 , where ∇g and ∇ log(p z /p z ) are evaluated at z = f -1 (x). This gives Eq. (37).

A.4 PROOF OF THE DOMAIN ADAPTATION ERROR THEOREM 5.6

To be consistent with the notation in the proofs, we prove the theorem by denoting the semanticidentified CSG p and the ground-truth CSG p * on the test domain as p and p, respectively. Proof. The new prior p (z) is learned by fitting unsupervised data from the test domain p(x). Applying the deduction in the proof A.2 of Theorem 5.4' to the test domain, we have that under any of the three conditions in Theorem 5.4', p( x) = p (x) indicates f # [p z ] = f # [p z ]. This gives p z = (f -1 • f ) # [p z ] = Φ # [p z ]. From Eq. ( 12) in the same proof, we have that: p(x) Ẽ[y|x] = (f # [g pz ] * p µ )(x) = ((f # [p z ]ḡ) * p µ )(x), p (x) Ẽ [y|x] = (f # [g p z ] * p µ )(x) = ((f # [p z ]ḡ ) * p µ )(x). From the proof A.3 of Theorem 5.5'(ii) (the paragraph under Eq. ( 41)), the semantic-equivalence between CSGs p and p indicates that ḡ = ḡ . So from the above two equations, we have p (x) Ẽ[y|x] = p (x) Ẽ [y|x] (recall that p(x) = p (x) indicates f # [p z ] = f # [p z ]). Since p(x) = p (x) (that is how p z is learned), we have for any x ∈ supp(p x ) or equivalently x ∈ supp(p x ), Ẽ [y|x] = Ẽ[y|x]. B ALTERNATIVE IDENTIFIABILITY THEORY FOR CSG The presented identifiability theory, particularly Theorem 5.4, shows that the semantic-identifiability can be achieved in the deterministic limit ( 1 σ 2 µ → ∞), but does not quantitatively describe the extent of violation of the identifiability for a finite variance σ 2 µ . Here we define a "soft" version of semanticequivalence and show that it can be achieved with a finite variance, with a trade-off between the "softness" and the variance. Definition B.1 (δ-semantic-dependency). For δ > 0 and two CSGs p and p , we say that they are δ-semantic-dependent, if there exists a homeomorphism Φ on S × V such that: (i) p(x|s, v) = p (x|Φ(s, v)), (ii) sup v∈V g(s) -g (Φ S (s, v)) 2 δ where we have denoted g(s) := E[y|s], and (iii) sup v (1) ,v (2) ∈V Φ S (s, v (1) ) -Φ S (s, v (2) ) 2 δ. In the definition, we have released the prior conversion requirement, and relaxed the exact likelihood conversion for p(y|s) in (ii) and the v-constancy of Φ S in (iii) to allow an error bounded by δ. When δ = 0, the v-constancy of Φ S is exact, and under the additive noise Assumption 5.1 we also have the exact likelihood conversion p(y|s) = p (y|Φ S (s, v)) for any v ∈ V. So 0-semantic-dependency with the prior conversion requirement reduces to the semantic-equivalence. Due to the quantitative nature, the binary relation cannot be made an equivalence relation but only a dependency. Here, a dependency refers to a binary relation with reflexivity and symmetry, but no transitivity. Proposition B.2. The δ-semantic-dependency is a dependency relation if the function g := E[y|s] is bijective and its inverse g -1 is 1 2 -Lipschitz. Proof. Showing a dependency relation amounts to showing the following two properties. • Reflexivity. For two identical CSGs p and p , we have p(x|s, v) = p (x|s, v) and p(y|s) = p (y|s). So the identity map as Φ obviously satisfies all the requirements in Definition B.1. • Symmetry. Let CSG p be δ-semantic-dependent to CSG p with homeomorphism Φ. Obviously Φ -1 is also a homeomorphism. For any (s , v ) ∈ S × V, we have p (x|s , v ) = p (x|Φ(Φ -1 (s , v ))) = p(x|Φ -1 (s , v )), and g (s ) -g((Φ -1 ) S (s , v )) 2 = g (Φ S (s, v)) -g(s) 2 δ where we have denoted (s, v) := Φ -1 (s , v ) here. So Φ -1 satisfies requirements (i) and (ii) in Definition B.1. For requirement (iii), we need the following fact: for any s (1) , s (2) ∈ S, s (1) -s (2) 2 = g -1 (g(s (1) )) -g -1 (g(s (2) )) 2 1 2 g(s (1) ) -g(s (2) ) 2 , where the inequality holds since g -1 is 1 2 -Lipschitz. Then for any s ∈ S, we have: sup v (1) ,v (2) ∈V (Φ -1 ) S (s , v (1) ) -(Φ -1 ) S (s , v (2) ) 2 sup v (1) ,v (2) ∈V 1 2 g (Φ -1 ) S (s , v (1) ) -g (Φ -1 ) S (s , v (2) ) 2 = sup v (1) ,v (2) ∈V 1 2 g (Φ -1 ) S (s , v (1) ) -g (s ) -g (Φ -1 ) S (s , v (2) ) -g (s ) 2 sup v (1) ,v (2) ∈V 1 2 g (Φ -1 ) S (s , v (1) ) -g (s ) 2 + g (Φ -1 ) S (s , v (2) ) -g (s ) 2 = 1 2 sup v (1) ∈V g (Φ -1 ) S (s , v (1) ) -g (s ) 2 + sup v (2) ∈V g (Φ -1 ) S (s , v (2) ) -g (s ) 2 δ, where in the last inequality we have used the fact that Φ -1 satisfies requirement (ii). So p is δ-semantic-dependent to p via the homeomorphism Φ -1 . The corresponding δ-semantic-identifiability result follows. Theorem B.3 (δ-semantic-identifiability). Assume the same as Theorem 5.4' and Proposition B.2, and let the bounds B's defined in Theorem 5.4'(iii) hold. For two such CSGs p and p , if they have p(x, y) = p (x, y), then they are δ-semantic-dependent for any δ σ 2 µ B 2 f -1 2B log p B g + B g + 3dB f -1 B f B g , where d := d S + d V . Proof. Let Φ := f -1 • f , where f and f are given by the two CSGs p and p via Assumption 5.1. We now show that p and p are δ-semantic-dependent via this Φ for any δ in the theorem. Obviously Φ is a homeomorphism on S × V, and it satisfies requirement (i) in Definition B.1 by construction due to Eq. ( 7) in the proof A.2 of Theorem 5.4'. Consider requirement (ii) in Definition B.1. Based on the same assumptions as Theorem 5.4', we have Eq. ( 25) hold for both CSGs: max{ E[y|x] -ḡ(x) 2 , E [y|x] -ḡ (x) 2 } σ 2 µ B 2 f -1 B log p B g + 1 2 B g + 3 2 dB f -1 B f B g , where we have denoted σ 2 µ := E[µ µ]. Since both CSGs induce the same p(y|x), so E[y|x] = E [y|x]. This gives: ḡ(x) -ḡ (x) 2 = E [y|x] -ḡ (x) -E[y|x] -ḡ(x) 2 E [y|x] -ḡ (x) 2 + E[y|x] -ḡ(x) 2 σ 2 µ B 2 f -1 2B log p B g + B g + 3dB f -1 B f B g . So for any (s, v) ∈ S × V, by denoting x := f (s, v), we have: g(s) -g (Φ S (s, v)) 2 = g((f -1 ) S (x)) -g ((f -1 ) S (f (s, v))) 2 = ḡ(x) -ḡ (x) 2 σ 2 µ B 2 f -1 2B log p B g + B g + 3dB f -1 B f B g . So the requirement is satisfied. For requirement (iii), note from the proof of Proposition B.2 that when g is bijective and its inverse is 1 2 -Lipschitz, requirement (ii) implies requirement (iii). So this Φ is a homeomorphism that makes p δ-semantic-dependent to p for any δ σ 2 µ B 2 f -1 2B log p B g + B g + 3dB f -1 B f B g . Note that although the δ-semantic-dependency does not have transitivity, the above theorem is still informative: for any two CSGs sharing the same data distribution, particularly for a well-learned CSG p and the ground-truth CSG p * , the likelihood conversion error sup (s,v)∈S×V g(s) -g (Φ S (s, v)) 2 , and the degree of mixing v into s, measured by sup v (1) ,v (2) ∈V Φ S (s, v (1) ) -Φ S (s, v (2) ) 2 , are bounded by σ 2 µ B 2 f -1 2B log p B g + B g + 3dB f -1 B f B g .

C MORE EXPLANATIONS ON THE MODEL

Explanations on our perspective. We see the data generation process as generating the conceptual latent factors (s, v) first, and then generating both x and y based on the factors. This follows Peters et al. (2017, Section 1.4 ) who promote the generation of an OCR dataset as the writer first comes up with an intension to write a character, and then writes down the character and gives its label based on the intension. It is also natural for medical image datasets, where the label may be diagnosed based on more fundamental features (e.g., PCR test results showing the pathogen) that are not included in the dataset but actually cause the medical image. This generation process is also considered by Mcauliffe & Blei (2008) ; Kilbertus et al. (2018) ; Teshima et al. (2020) . On the labeling process from images that one would commonly think of, we also view it as a s → y process. Human directly knows the critical semantic feature s (e.g., the shape and position of each stroke) by seeing the image, through the nature gift of the vision system (Biederman, 1987) . The label is given by processing the feature (e.g., the angle between two linear strokes, the position of a circular stroke relative to a linear stroke), which is a s → y process. The causal graph in Fig. 1 implies that x ⊥ ⊥ y|s. This does not indicate that the semantic factor s generates an image x regardless of the label y. Given s, the generated image is dictated to hold the given semantics regardless of randomness, so the statistical independence does not mean semantic irrelevance. If an image x is given, the corresponding label is given by p(y|x), which is p(s|x)p(y|s)ds by the causal graph. So the semantic concept to cause the label through p(y|s), is inferred from the image through p(s|x). Comparison with the graph y tx → s → x → y rx . This graph is considered by one of our reviewers, under the perspective of a communication channel, where y tx is a transmitted signal and y rx is the received. If the observed label y is treated as y tx , the graph then implies y → s. This is argued at the end of item (2) in Section 3 that it may make unreasonable implications. Moreover, the graph also implies that y is a cause of x, as is challenged in item (1) in Section 3. The unnatural implications arise since intervening y is different from intervening the "ground-truth" label. We consider y as an observation that may be noisy, while the "ground-truth label" is never observed: one cannot tell if the labels at hand are noise-corrupted, based on the dataset alone. For example, the label of either image in Fig. 2 may be given by a labeler's random guess. Our adopted causal direction s → y is consistent with these examples and is also argued and adopted by Mcauliffe & Blei (2008) ; Peters et al. (2017, Section 1.4 ); Kilbertus et al. (2018) ; Teshima et al. (2020) . If the observed label y is treated as y rx , the graph then implies x → y, as is challenged in item (1) in Section 3. It is also argued by Schölkopf et al. (2012) ; Peters et al. (2017, Section 1.4 ); Kilbertus et al. (2018) . Treating the observed label y as y rx and y tx as the "ground-truth" label may be the motivation of this graph. But the graph implies that y tx ⊥ ⊥ y rx |x, that is, p(y tx |x, y rx ) = p(y tx |x) and p(y rx |x, y tx ) = p(y rx |x). So modeling y tx (resp. y rx ) does not benefit predicting y rx (resp. y tx ) from x.

D RELATION TO EXISTING DOMAIN ADAPTATION THEORY

Existing DA theory In existing DA literature, the objective is to find a labeling function h : X → Y within a hypothesis space H that minimizes the target-domain risk R(h) := E p(x,y) [ (h(x) , y)] defined with a loss function : Y × Y → R. Since p(x, y) is unavailable, it is of practical interest to consider the source-domain risk R(h) and investigate its relation to R(h). Ben-David et al. (2010a) give a bound relating the two risks: R(h) R(h) + 2d 1 (p x , px ) + min{E p(x) [|h * (x) - h * (x)|], E p(x) [|h * (x) - h * (x)|]}, where: d 1 (p x , px ) := sup X∈X |p x [X] -px [X]|. Here X denotes the σ-algebra on X , d 1 (p x , px ) is the total variation between the two distributions, and h * ∈ argmin h∈H R(h) and h * ∈ argmin h∈H R( h) are the oracle/ground-truth labeling functions on the source and target domains, respectively (e.g., h * (x) = E[y|x] and h * (x) = Ẽ[y|x] if supp(p x ) = supp(p x )). Zhao et al. (2019) give a similar bound in the case of binary classification, in terms of the H-divergence d H in place of the total variance d 1 , where H := {sign(|h(x) -h (x)| -t) : h, h ∈ H, t ∈ [0, 1]}. Ben- David et al. (2010a) also argue that in this bound, the total variation d 1 is overly strict and hard to estimate, so they develop another bound which is better known (asymptotically; omitting estimation error from finite samples): R(h) R(h) + d H∆H (p x , px ) + λ H , where: d H∆H (p x , px ) := sup h,h ∈H E p(x) [ (h(x), h (x))] -E p(x) [ (h(x), h (x))] , λ H := inf h∈H R(h) + R(h) . Here d H∆H (p x , px ) is the H∆H-divergence measuring the difference between p(x) and p(x) under the discriminative efficacy of the labeling function family H, and λ H is the ideal joint risk achieved by H. Long et al. (2015) give a similar bound in terms of maximum mean discrepancy (MMD) d K in place of d H∆H . For successful adaptation, DA often makes the covariate shift assumption: h * = h * (or p(y|x) = p(y|x)) on supp(p x , px ) := supp(p x ) ∪ supp(p x ). DA-DIR DA based on learning domain-invariant representations (DA-DIR) (Pan et al., 2010; Baktashmotlagh et al., 2013; Long et al., 2015; Ganin et al., 2016) aims to learn a deterministic representation extractor η : X → S to some representation space S, in order to achieve a domain-invariant representation (DIR): p(s) = p(s), where p(s) := η # [p x ](s) and p(s) := η # [p x ](s) are the representation distributions on the two domains. The motivation is that, if DIR is achieved, then the distribution difference term in bound Eq. ( 43) or Eq. ( 44) diminishes, thus the bound is hopefully tighter on the representation space S than on the original data space X , so minimizing the source risk is more effective to minimizing the target risk. Let g : S → Y be a labeling function on the representation space. The end-to-end labeling function is effectively h = g •η. The typical objective for DA-DIR thus combines the two desiderata: min η∈E,g∈G R(g • η) + d(η # [p x ], η # [p x ]), where d(•, •) is a metric or discrepancy (d(q, p) = 0 ⇐⇒ q = p) on distributions, and E and G are the hypothesis spaces for η and g, respectively. For the existence of the solution of this problem, it is often assumed stronger that there exist η * ∈ E and to take its worst value (particularly, the two desiderata cannot guarantee η = η * or g = g * or g • η = h * = h * on supp(p x , px )). This is essentially an identifiability problem. g * ∈ G such that η * # [p x ] = η * # [p x ] and R(g * • η * ) = R(h * ). The examples do not contradict existing DA bounds. Consider a given representation extractor η. (1) Under Eq. ( 43). Applying the bound on the representation space S gives: R(g • η) R(g • η) + 2d 1 (η # [p x ], η # [p x ]) + min{E η # [px](s) [|g * η (s) -g * η (s)|], E η # [ px ](s) [|g * η (s) -g * η (s)|]}, where g * η and g * η are the optimal labeling functions on top of the representation extractor η. In the covariate shift case, DIR η Ben-David et al., 2010b; Gong et al., 2016) . Johansson et al. ( 2019) argue that they are still not sufficient even under their Assumption 3. # [p x ] = η # [p x ] and minimal source risk R(g * η • η) = R(h * ) are not sufficient to guarantee g * η = g * η ( In both examples by Johansson et al. (2019) ; Zhao et al. (2019) , the considered η, although achieving both desiderata, is not η * , and this η even renders different optimal g's: g * η = g * η . Johansson et al. ( 2019) claim that it is necessary to require η to be invertible to make g * η = g * η , and develop a bound that explicitly shows the effect of the invertibility of η. The η in the examples is not invertible. (2) Under Eq. ( 44). Applying the bound on the representation space S gives: E p(s,y) [ (g(s) , y)] E p(s,y) [ (g(s) [ (g(s) , y)] + E p(s,y) [ (g(s) , y)] , , y)] + d G∆G (η # [p x ], η # [p x ]) + inf g∈G E p(s,y) where p s,y := (η, id y ) # [p x,y ] with id y : (x, y) → y and similarly for ps,y . Note that E p(s,y) [ (g(s) , y)] = E p(x,y) [ (g(η(x) , y))] = R(g • η), so the last term on the r.h.s becomes: inf g∈G R(g • η) + R(g • η) = λ G•η , where G • η := {g • η : g ∈ G}. So the bound becomes: R(g • η) R(g • η) + d G∆G (η # [p x ], η # [p x ]) + λ G•η . ( ) This result is shown by Johansson et al. (2019) . They argue that finding η that achieves DIR and minimal training risk cannot guarantee a tighter bound since the last term λ G•η may be very large. In 2019) also explicitly shows the role of support overlap, thus can be called a support-invertibility bound. They also give an example to show that DIR (particularly implemented by minimizing MMD) is not necessary ("sometimes too strict") for learning the shared/invariant p(y|x). (3) A third bound. & Schindelin, 2003) , where JS(p, q) is the JS divergence. It is bounded: 0 d JS (p, q) 1. It is shown that (Zhao et al., 2019, 4.8) : d JS (p y , py ) d JS (η # [p x ], η # [p x ]) + R(g • + R(g • η). If d JS (p y , py ) d JS (η # [p x ], η # [p x ]) 9 , it is shown that (Zhao et al., 2019, Theorem 4.3  ): R(g • η) + R(g • η) 1 2 d JS (p y , py ) -d JS (η # [p x ], η # [p x ]) 2 , or when the two domains are allowed to have their own representation-level labeling functions g and g (Zhao et al., 2019, Corollary 4.1) , R(g • η) + R(g • η) 1 2 d JS (p y , py ) -d JS (η # [p x ], η # [p x ]) 2 . So when p(y) = p(y), we have d JS (p y , py ) > 0, so DIR that minimizes d JS (η # [p x ], η # [p x ]) becomes harmful to minimizing the target risk R(g • η). Arjovsky et al. ( 2019) point out that in the covariate shift case, achieving a DIR p(s) = p(s) implies p(y) = p(y) (since p(s) = p(s) and p(y|s) = p(y|s)). This may not hold in practice. When it does not hold, the bound above shows that DIR can limit prediction accuracy. Comparison with CSG Existing bounds Eqs. (43, 44, 45, 46) relate the source and target risks of a general and common labeling function h ∈ H, i.e., R(h) -R(h), which is for bounding an objective; while our bound Eq. ( 36) relates the target risks of the optimal labeling functions on the source h * and target h * domains, i.e., R(h * ) -R( h * ) or R(h * ) -R( h * ) , which measures the risk leap of the best source labeling function on the target domain. After adaptation, the prediction analysis (Eq. ( 42)) shows that CSG-DA achieves the optimal labeling function on the target domain in the infinite data limit. For Eq. ( 47), we are not minimizing d JS (η # [p(x)], η # [p(x)]), so our method is good under that view. In fact, in our model the representation distributions on the two domains are p(s) = p(s, v) dv and p(s) = p(s, v) dv (replacing η # [p(x)] and η # [p(x)]). We allow p(s, v) = p(s, v) of course and do not seek to match them. Essentially, we do not rely on the invariance of p(s|x) and p(y|x), or η * and h * , but the invariance of p(x|s, v) in the other direction (generative direction). This thus allows p(s|x) = p(s|x) and p(y|x) = p(y|x), or η * = η * and h * = h * , so we rely on an assumption different from the idea of all the bounds above. Since the data at hand is produced following a certain mechanism of nature anyway, the invariance in the generative direction p(x|s, v) is thus more plausible (see Section 3.2).

E METHODOLOGY DETAILS E.1 DERIVATION OF LEARNING OBJECTIVES

The Evidence Lower BOund (ELBO). A common and effective approach to matching the data distribution p * (x, y) is to maximize likelihood, that is to maximize E p * (x,y) [log p(x, y)]. It is equivalent to minimizing KL(p * (x, y) p(x, y)) (note that E p * [log p * (x, y)] is a constant), so it drives p(x, y) towards p * (x, y). But the likelihood function p(x, y) = p(s, v, x, y) dsdv involves an intractable integration, which is hard to estimate and optimize. To address this, the popular method of variational expectation-maximization (variational EM) introduces a tractable (has closed-form density function and easy to sample) distribution q(s, v|x, y) of the latent variables given observed variables, and a lower bound of the likelihood function can be derived: log p(x, y) = log E p(s,v) [p(s, v, x, y)] = log E q(s,v|x,y) p(s, v, x, y) q(s, v|x, y) E q(s,v|x,y) log p(s, v, x, y) q(s, v|x, y) =: L q,p (x, y), where the inequality follows Jensen's inequality and the concavity of the log function. The function L q,p (x, y) is thus called Evidence Lower BOund (ELBO). The tractable distribution q(s, v|x, y) is called variational distribution, and is commonly instantiated by a standalone model (from the generative model) called an inference model. Moreover, we have: L q,p (x, y) + KL(q(s, v|x, y) p(s, v|x, y)) = E q(s,v|x,y) log p(s, v, x, y) q(s, v|x, y) + E q(s,v|x,y) log q(s, v|x, y) p(s, v|x, y) = E q(s,v|x,y) log p(s, v, x, y) p(s, v|x, y) = E q(s,v|x,y) [log p(x, y)] = log p(x, y), so maximizing L q,p (x, y) w.r.t q is equivalent to (note that the r.h.s log p(x, y) is constant of q) minimizing KL(q p(s, v|x, y)) which drives q towards the true posterior (i.e., variational inference), and once this is (perfectly) done, L q,p (x, y) becomes a lower bound of log p(x, y) that is tight at the current model p, so maximizing L q,p (x, y) w.r.t p effectively maximizes log p(x, y), i.e., serves as maximizing likelihood. So the training objective becomes the expected ELBO, E p * (x,y) [L q,p (x, y)]. Optimizing it w.r.t q and p alternately drives p(x, y) towards p * (x, y) and q(s, v|x, y) towards p(s, v|x, y) eventually. The derivations and conclusions above hold for general latent variable models, with (s, v) representing the latent variables, and (x, y) observed variables (data variables). Variational EM for CSG. In the supervised case, the expected ELBO objective E p * (x,y) [L q,p (x, y)] can also be understood as the conventional supervised learning loss, i.e. the cross entropy, regularized by a generative reconstruction term. As explained in the main text (Section 4), after training, we only have the model p(s, v, x, y) and an approximation q(s, v|x, y) to the posterior p(s, v|x, y), and prediction using p(y|x) is still intractable. So we employ a tractable distribution q(s, v, y|x) to model the required variational distribution as q(s, v|x, y) = q(s, v, y|x)/q(y|x), where q(y|x) = q(s, v, y|x) dsdv is the derived marginal distribution of y (we will show that it can be effectively estimated and sampled from). With this instantiation, the expected ELBO becomes: E p * (x,y) [L q,p (x, y)] = p * (x, y) q(s, v, y|x) q(y|x) log p(s, v, x, y)q(y|x) q(s, v, y|x) dsdvdxdy = p * (x, y) q(s, v, y|x) q(y|x) log q(y|x) dsdvdxdy + p * (x, y) q(s, v, y|x) q(y|x) log p(s, v, x, y) q(s, v, y|x) dsdvdxdy = p * (x) p * (y|x) q(s, v, y|x) dsdv q(y|x) log q(y|x) dy dx + p * (x) p * (y|x) q(y|x) q(s, v, y|x) log p(s, v, x, y) q(s, v, y|x) dsdvdy dx = E p * (x) E p * (y|x) [log q(y|x)] + E p * (x) E q(s,v,y|x) p * (y|x) q(y|x) log p(s, v, x, y) q(s, v, y|x) , which is Eq. ( 1). The first term is the (negative) expected cross entropy loss, which drives the inference model (predictor) q(y|x) towards p * (y|x) for p * (x)-a.e. x. Once this is (perfectly) done, the second term becomes E p * (x) E q(s,v,y|x) [log p(s, v, x, y)/q(s, v, y|x)] which is the expected ELBO E p * (x) [L q(s,v,y|x),p (x, y)] for q(s, v, y|x). It thus drives q(s, v, y|x) towards p(s, v, y|x) and p(x) towards p * (x). It accounts for a regularization by fitting the input distribution p * (x) and align the inference model (predictor) with the generative model. The target of q(s, v, y|x), i.e. p(s, v, y|x), adopts a factorization p(s, v, y|x) = p(s, v|x)p(y|s) due to the graphical structure (Fig. 1 ) of CSG (i.e., y ⊥ ⊥ (x, v)|s). The factor p(y|s) is known (the invariant causal mechanism to generate y in CSG), so we only need to employ an inference model q(s, v|x) for the intractable factor p(s, v|x), so q(s, v, y|x) = q(s, v|x)p(y|s). Using this relation, we can reformulate Eq. ( 1) as: E p * (x,y) [L q,p (x, y)] = E p * (x,y) [log q(y|x)] + E p * (x) q(s, v|x)p(y|s) p * (y|x) q(y|x) log p(s, v, x) q(s, v|x) dsdvdy = E p * (x,y) [log q(y|x)] + E p * (x) p * (y|x) q(y|x) q(s, v|x)p(y|s) log p(s, v, x) q(s, v|x) dsdv dy = E p * (x,y) [log q(y|x)] + E p * (x,y) 1 q(y|x) E q(s,v|x) p(y|s) log p(s, v, x) q(s, v|x) , which is Eq. ( 2). With this form of q(s, v, y|x) = q(s, v|x)p(y|s), we have q(y|x) = E q(s,v|x) [p(y|s)] which can also be estimated and optimized using reparameterization. For prediction, we can sample from the approximation q(y|x) instead of the intractable p(y|x). This can be done by ancestral sampling: first sample (s, v) from q(s, v|x), and then use the sampled s to sample y from p(y|s). The conclusions and methods can also be applied to general latent generative models for supervised learning, with (s, v) representing the latent variables. When a model does not distinguish the two (groups of) latent factors s and v and treats them as one latent variable z = (s, v), following a similar deduction gives: E p * (x,y) [L q,p (x, y)] = E p * (x,y) [log q(y|x)] + E p * (x,y) 1 q(y|x) E q(z|x) p(y|z) log p(z, x) q(z|x) , where q(y|x) = E q(z|x) [p(y|z) ]. This is the conventional supervised variational auto-encoder (sVAE) baseline in our experiments. Variational EM to learn CSG with independent prior (CSG-ind). See the main text in Section 4.1 for motivation and basic methods. Since the prior is the only difference between p(s, v, x, y) and p ⊥ ⊥ (s, v, x, y), we have p(s, v, x, y)/p ⊥ ⊥ (s, v, x, y) = p(s, v)/p ⊥ ⊥ (s, v) = p(s, v)/p(s)p(v) = p(v|s)/p(v). So p(s, v, y|x) = p(v|s) p(v) p ⊥ ⊥ (x) p(x) p ⊥ ⊥ (s, v, y|x). As explained, inference models now only need to approximate the posterior (s, v)|x. Since p(s, v, y|x) = p(s, v|x)p(y|s) and p ⊥ ⊥ (s, v, y|x) = p ⊥ ⊥ (s, v|x)p(y|s) share the same p(y|s) factor, we have p(s, v|x) = p(v|s) p(v) p ⊥ ⊥ (x) p(x) p ⊥ ⊥ (s, v|x). The variational distributions q(s, v|x) and q ⊥ ⊥ (s, v|x) target p(s, v|x) and p ⊥ ⊥ (s, v|x) respectively, so we can express the former with the latter: q(s, v|x) = p(v|s) p(v) p ⊥ ⊥ (x) p(x) q ⊥ ⊥ (s, v|x). Once q ⊥ ⊥ (s, v|x) achieves its goal, such represented q(s, v|x) also does so. So we only need to construct an inference model for q ⊥ ⊥ (s, v|x) and optimize it. With this representation, we have: q(y|x) = E q(s,v|x) [p(y|s)] = E q ⊥ ⊥ (s,v|x) p(v|s) p(v) p ⊥ ⊥ (x) p(x) p(y|s) = p ⊥ ⊥ (x) p(x) E q ⊥ ⊥ (s,v|x) p(v|s) p(v) p(y|s) = p ⊥ ⊥ (x) p(x) π(y|x), where π(y|x) := E q ⊥ ⊥ (s,v|x) p(v|s) p(v) p(y|s) as in the main text, which can be estimated and optimized using the reparameterization of q ⊥ ⊥ (s, v|x). From Eq. ( 2), the expected ELBO training objective can be reformulated as: E p * (x,y) [L q,p (x, y)] = E p * (x,y) log q(y|x) + 1 q(y|x) E q(s,v|x) p(y|s) log p(s, v, x) q(s, v|x) p(v|s) p(v) p(y|s) log p ⊥ ⊥ (s, v, x) q ⊥ ⊥ (s, v|x) , = where in the second-last equality we have used the definition of π(y|x). This gives Eq. (3). Note that π(y|x) is not used in prediction, so there is no need to sample from it. Prediction is done by ancestral sampling from q ⊥ ⊥ (y|x), that is to first sample from q ⊥ ⊥ (s, v|x) and then from p(y|s). Using this reformulation, we can train a CSG with independent prior even on data that manifests a correlated prior. The objective Eq. ( 5) on the training domain for domain adaptation can be derived similarly. For numerical stability, we employ the log-sum-exp trick to estimate the expectations and compute the gradients. The black solid arrow specifies p(y|s) in the generative model, and the blue dashed arrows (representing computational directions but not causal directions) specify q(s, v|x) (or q ⊥ ⊥ (s, v|x) or q(s, v|x)) as the inference model.

E.2 INSTANTIATING THE INFERENCE MODEL

Although motivated from learning a generative model, the method can be implemented using a general discriminative model (with hidden nodes) with causal behavior. By parsing some of the hidden nodes as s and some others as v, a discriminative model could formalize a distribution q(s, v, y|x), which implements the inference model and the generative mechanism p(y|s). The parsing mode is shown in Fig. 3 , which is based on the following consideration. (1) The graphical structure of CSG in Fig. 1 indicates that (v, x) ⊥ ⊥ y|s, so the hidden nodes for s should isolate y from v and x. The model then factorizes the distribution as q(s, v, y|x) = q(s, v|x)q(y|s), and since the inference and generative models share the distribution on y|s (see the main text for explanation), we can thus use the component q(y|s) given by the discriminative model to implement the generative mechanism p(y|s). (2) The graphical structure in Fig. 1 also indicates that s ⊥ ⊥ v|x due to the v-structure (collider) at x ("explain away"). The component q(s, v|x) should embody this dependence, so the hidden nodes chosen as v should have an effect on those as s. Note that the arrows in Fig. 3 represent computation directions but not causal directions. We orient the computation direction v → s since all hidden nodes in a discriminative model eventually contribute to computing y. After parsing, the discriminative model gives a mapping (s, v) = η(x). We implement the distribution byfoot_5 q(s, v|x) = N (s, v|η(x), Σ q ). For all the three cases of CSG, CSG-ind and CSG-DA, only one inference model for (s, v)|x is required. The component (s, v)|x of the discriminative model thus parameterizes q ⊥ ⊥ (s, v|x) and q(s, v|x) for CSG-ind and CSG-DA. The expectations in all objectives (except for expectations over p * which are estimated by averaging over data) are all under the respective (s, v)|x. They can be estimated using η(x) by the reparameterization trick (Kingma & Welling, 2014) , and the gradients can be back-propagated. We need two more components beyond the discriminative model to implement the method, i.e. the prior p(s, v) and the generative mechanism p(x|s, v). The latter can be implemented using a generator or decoder architecture comparable to the component q(s, v|x). The prior can be commonly implemented using a multivariate Gaussian distribution, p(s, v) = N (( s v )|( µs µv ), Σ = Σss Σsv Σvs Σss ). We parameterize Σ via its Cholesky decomposition, Σ = LL , where L is a lower-triangular matrix with positive diagonals, which is in turn parameterized as L = Lss 0 Mvs Lvv with smaller lowertriangular matrices L ss and L vv and any matrix M vs . Matrices L ss and L vv are parameterized by a summation of positive diagonals (guaranteed via an exponential map) and a lower-triangular (excluding diagonals) matrix. The conditional distribution p(v|s) required by training CSG-ind is given by p(v|s) = N (v|µ v|s , Σ v|s ), where µ v|s = µ v + M vs L -1 ss (s -µ s ), Σ v|s = L vv L vv (see e.g., Bishop ( 2006)). This prior does not imply a causal direction between s and v (the linear Gaussian case of Zhang & Hyvärinen (2009) ) thus well serves as a prior for CSG.

F EXPERIMENT DETAILS

We use a validation set from the training domain for hyperparameter selection, to avoid overfitting to the finite training-domain data samples. The training and validation sets are constructed under a 80%-20% random split in each task. We note that hyperparameter selection in OOD tasks is itself controversial and nontrivial, and it is still an active research direction (You et al., 2019) . It is argued that if a validation set from the test domain is available, a better choice would be to incorporate it in learning as the semi-supervised adaptation task, instead of using it just for validation. As our methods are designed to fit the training domain data and our theory shows guarantees under a good fit of the training-domain data distribution, hyperparameter selection using a training-domain validation set is reasonable. We align the scale of the CE term in the objectives of all methods, and tune the coefficients of the ELBOs to be their largest values that make the final accuracy near 1 on the validation set, so that they wield the most power on the test domain while be faithful to explicit supervision. The coefficients are preferred to be large to well fit p * (x) (and p * (x) for domain adaptation) to gain generalizability in the test domain, while they should not affect training accuracy, which is required for a good fit of the training distribution. The supervised variational auto-encoder (sVAE) baseline method is a counterpart of CSG that does not separate its latent variable z into s and v. This means that all its latent variables in z directly (i.e., not mediated by s) affect the output y. It is learned by optimizing Eq. ( 48) for OOD generalization, and adopts a similar method as CSG-DA for domain adaptation. To align the model architecture for fair comparison, this means that the latent variable z of sVAE can only be taken as the latent variable s in CSG. All the experiments are implemented in PyTorch.



Supplement C provides more explanations on the model. A transformation is a homeomorphism if it is a continuous bijection with continuous inverse. The 2-norm • 2 for matrices refers to the induced operator norm (not the Frobenius norm). ] (36) if supp(p x ) = supp(p x ), where ∆ denotes the Laplacian operator. Unfortunately, it seems that the opposite direction holds when there exist η * and g * (unnecessarily the ones in the existence assumption or the Assumption 3 of Johansson et al. (2019)) such that: py = (g * • η * ) # [px] and py = (g * • η * ) # [p x ] and that η is a reparameterization of η * , due to the celebrated data processing inequality. Other approaches to introducing randomness are also possible, such as employing stochasticity on the parameters/weights as in Bayesian neural networks(Neal, 1995), or using dropout(Srivastava et al., 2014;Gal & Ghahramani, 2016). Here we adopt this simple treatment to highlight the main contribution.



Figure 1: The graphical structure of the proposed Causal Semantic Generative model (CSG) for the semantic s and variation v latent factors and supervised data (x, y). Black solid arrows represent invariant causal generating mechanisms p(x|s, v) and p(y|s), the black undirected edge represent a domain-specific prior p(s, v), and blue dashed bended arrows represent the inference model q(s, v|x) for learning and prediction.

the condition that it is a.e. non-zero indicates that F [f ] = F [f ] a.e. thus f = f a.e. See also Khemakhem et al. (2019, Theorem 1). A.1 PROOF OF THE EQUIVALENCE RELATION Proposition A.6. The semantic-equivalence defined in Definition 5.3 is an equivalence relation if V is connected and is either open or closed in R d V .

Theorem 5.5' (OOD generalization error). Let Assumptions 5.1 and 5.2 hold. (i) Consider two CSGs p and p that share the same generative mechanisms p(x|s, v) and p(y|s) but have different priors p s,v and ps,v . Then up to O(σ 2 µ ) where σ 2 µ := E[µ µ], we have for any x ∈ supp(p x ) ∩ supp(p x ),

Assumption 3 of Johansson et al. (2019) further assumes covariate shift and that g * • η * = h * on supp(p x , px ); that is, there exist η * ∈ E and g * ∈ G such that η * # [p x ] = η * # [p x ] and g * • η * = h * = h * on supp(p x , px ). They also mention that this is not guaranteed to hold in practice. Problems of DA-DIR Johansson et al. (2019); Zhao et al. (2019) give examples where even under as strong an assumption as Assumption 3 of Johansson et al. (2019) (i.e., covariate shift and a strong existence assumption), the two desiderata of DA-DIR (i.e., minimal source risk R(g • η) = R(h * ) and DIR η # [p x ] = η # [p x ]) still allow the bounds to be uselessly loose and the target risk R(g • η)

both examples by Johansson et al. (2019); Zhao et al. (2019), supp(p x ) ∩ supp(p x ) = ∅. It may cause the problem that g • η can be very different from h * on supp(p x ) even when R(g • η) = R(h * ). The developed bound by Johansson et al. (

Figure 3: Parsing a general discriminative model as an inference model for CSG.The black solid arrow specifies p(y|s) in the generative model, and the blue dashed arrows (representing computational directions but not causal directions) specify q(s, v|x) (or q ⊥ ⊥ (s, v|x) or q(s, v|x)) as the inference model.

Accuracy (%) of various methods (ours in bold) on OOD generalization (left) and domain adaptation (right) for shifted MNIST. Averaged over 10 runs. 53.3± 8.8 74.4± 6.2 60.5±13.9 90.9± 6.8 94.5± 4.5 91.6± 5.5 11.3±11.5 60.1±41.4 50.3± 3.9 95.2±11.9 N

: Results on ImageCLEF-DA(ima, 2014). Results of CE, DANN, DAN and CDAN are taken fromLong et al. (2018). , showing the benefit of separating semantics and variation and modeling the variation explicitly, so the model could consciously drive semantic representation into s. For domain adaptation, existing methods differ a lot, and are hard to perform well on both test domains. When fail to identify, adaptation sometimes even worsens the result, as the misleading representation based on position gets strengthened on the unsupervised test data. CSG is benefited from adaptation by leveraging test data in a proper way that identifies the semantics.

Zhao et al. (2019) develop another bound for binary classification, where Y := {0, 1} and R(h) := E p(x) [|h * (x) -h(x)|]. Denote d JS (p, q) :=JS(p, q) as the JS distance (Endres

E p * (x,y) log

F.1 SHIFTED MNIST

We use a multilayer perceptron (MLP) with sigmoid activation with 784(for x)-400-200(first 100 for v)-50(for s or z)-1(for y) nodes in each layer for the inference model of generative methods, and use an MLP with 50(for s)-(100(for v)+100)-400-784(for x) nodes in each layer for their generative component p (x|s, v) . We use a larger architecture with 784-600-300-75-1 nodes in each layer for discriminative methods to compensate additional parameters of generative methods. For all the methods, we use a mini-batch of size 128 in each optimization step, and use the RMSprop optimizer (Tieleman & Hinton, 2012) , with weight decay parameter 1 × 10 -5 , and learning rate 1 × 10 -3 for OOD generalization and 3 × 10 -4 for domain adaptation. These hyperparameters are chosen and then fixed, by running and then validating using CE and DANN. For generative methods, we take the Gaussian variances of p(x|s, v) and q(s, v|x) as 0.03 2 . The scale of the standard derivations of these conditional Gaussian distributions are chosen small to meet the intense causal mechanism assumption in our theory (e.g., in Theorem 5.4).We train the models for 100 epochs when all the methods converge in terms of loss and training accuracy. We align the scale of the CE term in the objectives of all methods, and scale the ELBO terms with the largest weight that makes training accuracy near 1 in OOD generalization. We then fix the tuned weight and scale the weight of adaptation terms in a similar way for domain adaptation. Other parameters are tuned similarly. For generative methods, the ELBO weight is 1×10 -5 selected from {1, 3}×10 {-6,-5} ∪1×10 {-2,-1,0,1,2} , and the adaptation weights for sVAE-DA and CSG-DA are 1×10 -2 selected from 1×10 {0,-1,-2,-3,-4} and 1×10 -5 selected from 1×10 {0,-1,-2,-3,-4} ∪ {1, 3} × 10 {-5,-6,-7,-8} . For domain adaptation methods, the adaptation weight is 1 × 10 -4 except for CDAN which adopts 1 × 10 -2 , all selected from 1 × 10 {0,-1,-2,-3,-4} . For CNBB, we use regularization coefficients 1 × 10 -4 and 3 × 10 -6 to regularize the sample weight and learned representation, and run 4 inner gradient descent iterations with learning rate 1×10 -3 to optimize the sample weight. These parameters are selected from a grid search where the range of the parameters are: {1, 3} × 10 {-2,-3,-4} , {1, 3} × 10 {-4,-5,-6} , {4, 8}, 1 × 10 {-1,-2,-3} .

F.2 IMAGECLEF-DA

We adopt the same setup as in Long et al. (2018) . We use the ResNet50 structure pretrained on ImageNet as the backbone of the discriminative/inference model. Input images are cropped and resized to shape (3, 224, 224) . For CSG, we select the first 128 dimensions of the bottleneck layer (the resized last fully connected layer of ResNet50, with output dimension 1024) as the variable v, and the output of a subsequent fully connected layer with output dimension 1024 as the variable s. The output is produced by a linear layer built on s.For generative methods (i.e., our methods and sVAE(-DA)), we construct an image decoder/generator that uses the DCGAN model (Radford et al., 2015) pretrained on Cifar10 as the backbone. The pretrained DCGAN is adapted from the PyTorch-GAN-Zoo 11 . The generator connects to the DCGAN backbone by an MLP layer to match DCGAN's input dimension 120, and generates images of desired size (3, 224, 224) by appending to DCGAN's output of size (3, 64, 64) with an transposed convolution layer with kernel size 4, stride size 4 and padding size 16.Following Long et al. (2018) , we use a mini-batch of size 32 in each optimization step, and adopt the SGD optimizer with Nesterov momentum parameter 0.9, weight decay parameter 5 × 10 -4 , and a shrinking step size scheme with initial scale 1 × 10 -3 , shrinking exponent 0.75 and per-datum coefficient 6.25 × 10 -6 . For CSG methods, the Gaussian variances of p(x|s, v) and q(s, v|x) are taken as 0.1 and 3.0, respectively. The ELBO weight is 1 × 10 -8 for CSG methods and 1 × 10 -7 for sVAE, both selected from 1×10 {-2,-4,-6} ∪{1, 3}×10 {-7,-8,-9,-10} . The adaptation weights for sVAE-DA and CSG-DA are 1 × 10 -8 for task C→P and 1 × 10 -7 for task P→C, selected from the same range. For CNBB, we use regularization coefficients 1 × 10 -6 and 3 × 10 -6 to regularize the sample weight and learned representation, and run 4 inner gradient descent iterations with learning rate 1×10 -4 to optimize the sample weight. These parameters are selected from a grid search where the range of the parameters are: 1 × 10 {-4,-5,-6,-7} ∪ {3 × 10 -6 }, {1, 3} × 10 {-5,-6,-7} , {4}, 1 × 10 {-2,-3,-4,-5} . 11 https://github.com/facebookresearch/pytorch_GAN_zoo 

