IDENTIFYING LATENT CAUSAL CONTENT FOR MULTI-SOURCE DOMAIN ADAPTATION Anonymous

Abstract

Multi-source domain adaptation (MSDA) learns to predict the labels in target domain data, under the setting that data from multiple source domains are labelled and data from the target domain are unlabelled. Most methods for this task focus on learning invariant representations across domains. However, their success relies heavily on the assumption that the label distribution remains consistent across domains, which may not hold in general real-world problems. In this paper, we propose a new and more flexible assumption, termed latent covariate shift, where a latent content variable z c and a latent style variable z s are introduced in the generative process, with the marginal distribution of z c changing across domains and the conditional distribution of the label given z c remaining invariant across domains. We show that although (completely) identifying the proposed latent causal model is challenging, the latent content variable can be identified up to scaling by using its dependence with labels from source domains, together with the identifiability conditions of nonlinear ICA. This motivates us to propose a novel method for MSDA, which learns the invariant label distribution conditional on the latent content variable, instead of learning invariant representations. Empirical evaluation on simulation and real data demonstrates the effectiveness of the proposed method.

1. INTRODUCTION

Traditional machine learning requires the training and testing data to be independent and identically distributed (Vapnik, 1999) . This strict assumption may not be fulfilled in various potential real-world applications. For example, in medical applications, it is common to seek to train a model on patients from a few hospitals and generalize it to a new hospital (Zech et al., 2018) . In this case, it is often reasonable to consider that the distributions of data from training hospitals are different from the new hospital (Koh et al., 2021) . Domain adaptation is a promising research area to handle such problems. In this work, we focus on multi-source DA (MSDA) settings where source domain data are collected from multiple domains. Formally, let x denote the input, e.g. image, y denote the labels in source and target domains, and D denote the domain index. We observe labeled data pairs (x S , y S ) from the multiple joint distributions p(x, y|D = 1), ..., p(x, y|D = m), ..., p(x, y|D = M ) in source domains, and unlabeled input data samples x T from the joint distribution p(x, y|D T ) in the target domain. The training phase of MSDA is to use the sets of (x S , y S ) and x T , to train a predictor so that it can provide a satisfactory estimation for y T in the target domain. The key for MSDA is to understand how the joint distribution p D (x, y) change across all different source and target domains. Most early methods assume that the change of the joint distribution results from Covariate Shift (Huang et al., 2006; Bickel et al., 2007; Sugiyama et al., 2007; Wen et al., 2014) , e.g., p D (x, y) = p D (y|x)p D (x), as depicted by Figure 1(a) . This setting assumes that p D (x) changes across domains, while the conditional distribution p D (y|x) is invariant across domains. Such assumption may not always hold for some real applications, e.g., image classification. For example, the assumption of invariant p D (y|x) implies that p D (y) should change as p D (x) changes. However, we can easily change style information (e.g., hue, view) in the images to change p D (x) and keep p D (y) unchanged, which is common in classification but violates the assumption. In contrast to covariate shift, most current works consider Conditional Shift as depicted by Figure 1(b) . It assumes that the conditional p D (x|y) changes while p D (y) is invariant across domains (Zhang et al., 2013; 2015; Schölkopf et al., 2012; Stojanov et al., 2021; Peng et al., 2019) . This situation motivates a popular class of methods focusing on learning invariant representations across domains to approach the latent content variable z c in Figure 1 (b) (Ganin et al., 2016; Zhao et al., 2018; Saito et al., 2018; Mancini et al., 2018; Yang et al., 2020; Wang et al., 2020; Li et al., 2021; Stojanov et al., 2021) . However, the label distribution p D (y) may change across domains in many real application scenarios (Tachet des Combes et al., 2020; Lipton et al., 2018; Zhang et al., 2013) , for which learning invariant representations may be resulting in degenerating performance. In theory, there exists an upper bound on the performance of learning invariant representations when label distribution changes across domains (Zhao et al., 2019) . To understand more deeply and handle LCS, we propose a latent causal model to formulate the data and label generating process, by introducing the latent style variable z s to complement z c as depicted in Figure 2 . To analyse the identifiability of the proposed causal model, we also introduce latent noise variables n c and n s , which represent some unmeasured factors influencing z c and z s , respectively. As a result, we can leverage recent progress in the identifiability result of nonlinear ICA (Hyvarinen et al., 2019; Khemakhem et al., 2020) to analyse the identifiability of the proposed latent causal model. We show that although completely identifying the proposed latent causal model is often not possible without further assumptions due to transitivity in latent space, partially identifying the latent content variables z c up to scaling is tractable, by integrating the identifiability result of nonlinear ICA with the dependence between n c and y. This motivates us to propose a novel method to learn the invariant conditional distribution p D (y|z c ) for LCS, instead of learning invariant representations. Relying on the guaranteed identifiability on z c , the proposed method provides a principled way to ensure that the covariant z c can be identified on the target domain data, and the learned predictor p D (y|z c ) can generalize to the target domain. Empirical evaluation on synthetic and real data demonstrates the effectiveness of the proposed method, compared with state-of-the-art methods. Overall, our main contributions can be summarized as follows: (i) Differ from the commonly-used Conditional Shift as shown in Figure 1 (b), which assume label distribution to be the same across domains, we propose a new problem setting, latent covariate shift, as shown in Figure 1  (c). (ii) We propose a latent causal model for latent covariate shift. Leveraging the existing identifiability results of nonlinear ICA, we provide an analysis about the identifiability of the proposed latent causal graph, which provides guarantee for identifying the latent causal content variable z c . (iii) Under the identifiability result, we design a new method for domain adaptation, and empirically evaluate the proposed method on simulation and real data.

2. RELATED WORK

Learning invariant representations. Due to the limitation of covariate shift in image data, most current works for domain adaptation consider the conditional shift, which learns invariant representations across domains (Ganin et al., 2016; Zhao et al., 2018; Saito et al., 2018; Mancini et al., 2018; Yang et al., 2020; Wang et al., 2020; Li et al., 2021; Wang et al., 2022b; Zhao et al., 2021) . Such invariant representations can be obtained by applying suitable linear or nonlinear transformations on the input data. The key of these methods is how to enforce the invariance of the learned representations. For example, the invariance can be enforced by maximum classifier discrepancy (Saito et al., 2018) , or by a domain discriminator for adversarial training (Ganin et al., 2016; Zhao et al., 2018; 2021) , or by moment matching (Peng et al., 2019) , or by relation alignment loss (Wang et al., 2020) , or by pseudo labeling (Wang et al., 2022b) . However, all these methods require label distribution to be invariant across domains. As a result, when label distribution is varying across domains, they may perform well only in the overlapping areas among label distributions in different domains, while facing with challenges in the non-overlapping areas. To overcome this, recent progress focuses on learning invariant representations conditional on the label across domains (Gong et al., 2016; Ghifary et al., 2016; Tachet des Combes et al., 2020) . One of the challenges in these methods is that the labels in the target domain is unavailable. More importantly, these methods do not guarantee that the learnt representations to be consistent with the true relevant information for predicting in the target domain, thus there is no principled way to guarantee that the learned predictor can generalize to the target domain.

Learning invariant conditional distribution p D (y|z c

). There exist few of works exploring the invariant conditional distribution p D (y|z c ) for domain adaptation (Kull & Flach, 2014; Bouvier et al., 2019) . Differ from these two works, the proposed method provides the identifiability of z c , so that the learned p D (y|z c ) in this work can generalize to the target domain in a principled way. Besides, in the context of out-of-distribution generalization, some recent works explore learning invariant conditional distribution p D (y|z c ) (Arjovsky et al., 2019; Sun et al., 2021; Liu et al., 2021; Lu et al., 2021) . For example, Arjovsky et al. (2019) imposes learn the optimal invariant predictor across domains from the viewpoint of an intimate link between invariance and causation, while the proposed method directly explores conditional invariance given the proposed latent causal model. Sun et al. (2021) mainly focus on single domain, while the proposed method consider multiple domains. The proposed method is also different from the work in Liu et al. (2021) in that the former assume the latent content variable caused by the style variable, while the latter depends on a confounder to model the causal relation between the latent content variable and the style variable. Unlike the work in Lu et al. (2021) that the label is treated as a variable causing the other latent variables, the proposed method assumes that the label have no child nodes. Causality for Domain Generalization It has been shown that there is closely the connection between causality and generalization (Peters et al., 2016) . Motivated by this, most of current methods leverage to introduce new methods in various applications, e.g., domain generalization, (Mahajan et al., 2021; Christiansen et al., 2021; Wang et al., 2022a ),text classification (Veitch et al., 2021) , Out-of-Distribution Generalization Ahuja et al. (2021) . Perhaps the closest to our problem setting is domain generalization, where one can not 'see' input data x. In general, because one can not 'see' input data x for domain generalization, obtaining identifiability result in the setting of domain generalization is generally not possible. In contrast, this work provides the identifiability result, providing a principled way to guarantee that the learned predictor can generalize to the target domain.

3. THE PROPOSED LATENT CAUSAL MODEL FOR LATENT COVARIATE SHIFT

We introduce a latent causal model to formulate features and label generative process to handle LCS as depicted by Figure 2 . It introduces the observed domain variable D to denote in which specific domain data are collected. n c and n s represent some unmeasured factors corresponding to latent content noise and latent style noise, respectively. n c and n s influence the latent content variable z c and the latent style variable z s , respectively. Generally speaking, z c and z s should be dependent given the domain variable D. Here we consider that z c causes z s , to model the correlation between z s and y. In the proposed latent causal model, p D (z c ) change across domains while p D (y|z c ) is invariant across domains, which meets the basic assumption in the proposed latent covariate shift. In the following, we discuss two key causal assumptions, which also highlight the novelty of the proposed latent causal model. Gong et al., 2016; Stojanov et al., 2019; Li et al., 2018) , while we employ z c → y. We argue that these two cases are not contradictory since the labels y in these two cases represent two different physical meanings. To understand this point, let ŷ replace y in the first case (i.e., ŷ → x) to distinguish from y in the second case (i.e., z c → y). For the first case, consider the following generative process of images. A label should be first sampled, e.g., ŷ, then one may determine content information regarding to the label ŷ, and finally generate a image, which is a reasonable assumption in many real application scenarios. In the proposed latent causal model, n c play a role to replace ŷ and causes the content variable z c . For the second case, z c → y, it formulates the process that experts extract content information from given images and then provide reasonable labels according to their domain knowledge. This assumption has been employed by some recent works (Mahajan et al., 2021; Liu et al., 2021; Sun et al., 2021) . Particularly, these two different labels, ŷ and y, has been simultaneously considered in Mahajan et al. (2021) . z c causes z s : Here we consider having the object essence z c first, from which a latent style z s springs to render x. The correlation between y and z s can be seen as a spurious correlation that should not contribute to predicting y. We here employ z c as a confounding factor of both y and z s to model the spurious correlation. The rationality of this assumption can be further verified from the viewpoint of the converse. In particular, if we assume that z s causes z c , all high-level information in input data x, z s and z c would be causally related to the label y, which can not model the spurious correlation and is obviously unreasonable. Therefore, assuming z c → z s is more persuasive and consistent with previous works (Gong et al., 2016; Stojanov et al., 2019; Mahajan et al., 2021) . One recent work in Sun et al. (2021) leverages an additional variable as a confounding factor that influences both the content variable z c and the style variable z s to model their relation. Interestingly, the identifiability result in Sun et al. (2021) does not depend on the confounding factor. As a result, the confounding factor can be incorporated into the domain index, which is equivalent to the case where z c and z s are independent given the domain index. By contrast, the proposed latent causal model assumes a more general setting where z c and z s are dependent given the domain index, as depicted by Figure 2 . We will further verify the advantages of the proposed latent causal model in experiments, compared with Sun et al. (2021) .

4. IDENTIFIABILITY ANALYSIS OF THE PROPOSED LATENT CAUSAL MODEL

In this section, we provide identifiability analysis for the proposed latent causal model. We first build a connection between the proposed latent causal model and nonlinear ICA by using the independence among latent noise variables. We then show that it is still challenging to completely identify the proposed latent causal model due to transitivity, even with the identifiability result in nonlinear ICA. We finally show how to partially identify the latent content variable z c , by integrating the identifiability result of nonlinear ICA (Khemakhem et al., 2020) with the dependence between n c and y.

4.1. RELATING THE PROPOSED LATENT CAUSAL MODEL WITH NONLINER ICA

The proposed latent causal model splits latent noise variables n into two disjoint parts, n c and n s , as depicted by Figure 3 . Since n i models the noise information, the latent noise variables n i are assumed to be independent with each other in a causal system (Pearl, 2000; Spirtes et al., 2001) 1 . As a result, it is natural to connect the latent noise variables n i with latent indepen-dent variables in nonlinear ICA. Specifically, nonlinear ICA aims to separate independent latent variables conditional on D, e.g., n i , from observed mixing data, e.g., x, generated by a nonlinear function. Recent progress in Khemakhem et al. (2020) shows that one can recover n i up to permutation and scaling with relatively mild conditions, e.g., there is an auxiliary observed variable, similar as D in Figure 3 , which modulates the distributions of all independent latent variables n i . Figure 3 : Relating with Nonliner ICA. The identifiability result of n i holds in both source and target domains, since we have feature data x from both source and target domains in the training phase for domain adaptation. However, the permutation indeterminacy implies that we can not determine which recovered variables n i correspond to n c (or n s ) without further information. Therefore, using the identifiability result of nonlinear ICA only is insufficient to identify n c and n s . We will further discuss how to handle the permutation indeterminacy in Section 4. Although completely identifying the proposed latent causal model is challenging, for domain adaptation application, we are only interested in the identifiability of z c , instead of the latent style variable z s , since label y is only caused by z c . Thanks to the observed y S from source domains, we have the following identifiability result: Proposition 4.2. Assume that (i) all latent noise variables n i can be identified up to permutation and scaling, the latent content variable z c in the proposed latent causal model can be identifiable up to scaling by using the dependence between n S c and y S from source domains. Condition (i) can be obtained from the identifiability result of nonlinear ICA (Khemakhem et al., 2020) . The proposition shows that the content variable z c can be identified up to scaling by using the identifiability result of nonlinear ICA and the dependence between y and n c , in which the identifiability result of nonlinear ICA ensures to recover n i up to permutation and scaling, while the dependence removes the permutation indeterminacy to identify n c and thus z c . The scaling indeterminacy of z c results from the scaling indeterminacy of n i and the unknown mapping form n c to z c . The scaling indeterminacy of z c has no significance and can be ignored in latent space, since it can be 'absorbed' by the nonlinear mapping from z c to y. For example, consider the recovered variable ẑc and its scaling scaling(ẑ c ). When we try to learn a invariant predictor g(•) form ẑc to y, the scaling indeterminacy can be 'absorbed' by a composition predictor, e.g., g(scaling(•)).

5. LEARNING INVARIANT p D (y|z c ) FOR MSDA

The identifiable z c provides a principled way to guarantee that we can learn the conditional distribution p D (y|z c ) that is invariant across domains and thus can be generalized to the target domain. Furthermore, since the identifiable n c is the only parent node of z c , learning p D (y|z c ) can be transferred into learning p D (y|n c ), which is also invariant across domains as depicted by Figure 2 . In this section, we propose a novel method to show how to learn the invariant conditional distribution p D (y|n c ) for MSDA.

5.1. THE PROPOSED METHOD FOR LEARNING INVARIANT p D (y|n c )

As analyzed in Section 4.3, the identifiability of z c is based on the identifiability result of nonlinear ICA, so we need to identify n i first. To meet the conditions of identifiable n i as mentioned in Khemakhem et al. (2020) , we employ the following Gaussian prior on n c and n s : p(n|D) = p(n c |D)p(n s |D) = N µ nc (D), Σ nc (D) N µ ns (D), Σ ns (D) , where µ and Σ denote the mean and variance, respectively. Both depend on the domain variable D and can be implemented with multi-layer perceptrons. Since n i are to independent noise variables, Σ here is a diagonal matrix. Some other exponential distributions, e.g., Laplace distribution, also meet the conditions of identifiable n i and thus are also feasible (Khemakhem et al., 2020) . We here use the Gassian prior since it is easy to leverage the reparameterization trick (Kingma & Welling, 2013) . The proposed Gaussian prior equation 1 gives rise to the following variational posterior: q(n|D, x) = q(n c |D, x)q(n s |D, x) = N µ ′ nc (D, x), Σ ′ nc (D, x) N µ ′ ns (D, x), Σ ′ ns (D, x) , (2) where µ ′ and Σ ′ denote the mean and variance of the posterior, respectively; both of them depend on the domain variable D and the observed x and can be implemented by multi-layer perceptrons. Combining this with the Gaussian prior in equation 1, we can derive the following evidence lower bound (ELBO): max E q(n|D,x) p(x|D) -D KL q(n|D, x)||p(n|D) , where D KL denotes the Kullback-Leibler divergence. By maximizing the ELBO equation 3, we can then recover n i up to scaling and permutation. To remove the permutation indeterminacy as mentioned in Section 4.3, we can simultaneously maximize the correlation between y S and n S i to identify n S c . Here we employ the mutual information to maximize the dependence. As a result, we arrive at: max λ E q(n|D,x) p(x|D) -D KL q(n|D, x)||p(n|D) ELBO + I(n S c , y S ) M I , where I(n S c , y S ) denotes the mutual information between n S c and y S in source domains, and λ is a regularization hyper-parameter that balances the ELBO and the mutual information (MI). The proposed method is termed iLCC-MSDA (identifiable Latent Causal Content for MSDA), including two components: ELBO and mutual information. The ELBO component ensures that n i can be recovered up to scaling and permutation. The MI component handles the permutation, and thus ensures which recovered n i corresponds to the latent content variables n c . In the implementation, we use the variational low bounder of mutual information proposed by Alemi et al. (2016) to approximate the mutual information in equation 4. max λ E q(n|D,x) p(x|D) -D KL q(n|D, x)||p(n|D) + E q(nc|D,x) p(y S |n S c ) . (5) A graphical depiction of the proposed iLCC-MSDA is shown in Figure 5 . Constraining the independence among n i As we discussed, the performance of the proposed iLCC-MSDA above relies on assumptions on the identifiability of n i , which requires that there are enough domains across which n i changes in order to well capture the statistical independence. However, in real applications, we may not have sufficient domains. To mitigate this issue, motivated by disentangled representations (Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018) , we proposes to use a hyperparameter β to enhance the independence among n i . Entropy regularization In the loss function in equation 5, we maximize the causal influence between y S and n S c ) in source domains with mutual information. To encourage such causal influence in target domain, we can also maximize the mutual information between ŷT and n T c by minimizing the following conditional entropy: L ent = -E p(ŷ T |n T c ) log p(ŷ T |n T c ) , where ŷT denotes the estimated label in the target domain. This regularization has been empirically used to make label predictions more deterministic in previous works (Wang et al., 2020; Li et al., 2021) , while we consider it from the view of causality. Therefore, our final loss function is: max λ(E q(n|D,x) (p(x|D)) -βD KL q(n|D, x)||p(n|D) ) + E q(nc|D,x) p(y S |n S c ) + γL ent , where β, λ, γ are hyper-parameters that trade off the independence of n c and n s , the classifier and the entropy regularization loss terms.

6. EXPERIMENTS

6.1 EXPERIMENTS ON SYNTHETIC DATA

Dataset

We conduct experiments on synthetic data, generated by the following process: we divide the latent variables into 5 segments, which are corresponding to 5 domains. Each segment includes 1000 examples. Within each segment, we first sample the mean and the variance from uniform distributions [1, 2] and [0.3, 1] for the latent exogenous variables n c and n s , respectively. Then for each segment, we generate z c , z s , x and y according to the following structural causal model: z c := n c , z s := z 3 c + n s , y := z 3 c , x := M LP (z c , z s ), where following (Khemakhem et al., 2020) we mix the latent z c and z s using a multi-layer perceptron to generate x.

Results

In implementation, we use the first 4 segments as source domains, and the last segment as target domain. 6.2 EXPERIMENTS ON REAL DATA Dataset We further evaluate the proposed iLCC-MSDA on benchmark domain adaptation dataset PACS dataset (Li et al., 2017) and Terra Incognita (Beery et al., 2018) . In the original PACS, the label distributions for any two domains is very similar (i.e., D KL ≈ 0.1). This data is suitable for domain adaptation with conditional shift as shown in Figure 1 Ablation studies The bottom of Table 1 and 2 presents the results for ablation studies. We can observe that entropy regularization equation 6 significantly increases the performance (around 10% and 5%) of the proposed method on both dataset. This justifies the importance of the causal relation between y and n c , which is consistent with our model assumption. Besides, the hyper-parameter β also boosts the performance by enforcing the independence among the latent variables n i . The results by different methods on PACS are presented in Table 1 . We can observe that as the increase of KL divergence of label distribution, the performance of MCDA, M3DA, LtC-MSDA and T-SVDNet, which are based on learning an invariant representations, gradually degenerates. When the KL divergence is about 0.7, the performance of these methods is worse than traditional ERM. Compared with IRM, IWCDAN and LaCIM, which allows label distribution to change across domains, the proposed iLCC-MSDA obtains the best performance, due to our theoretical supports. Table 2 depicts the results by different methods on challenging Terra Incognit. The proposed iLCC-MSDA achieves a significant performance gain on the challenging task →L7. Compared with the other methods, the proposed iLCC-MSDA is the only one that is superior to ERM.

7. CONCLUSION

The key for domain adaptation is to understand how the joint distribution of features and label changes across domains. Previous works usually assume covariate shift or conditional shift to inter- 2-layer fully connected network for the prior model, use 2-layer fully connected network to transfer n c to z c . For hyper-parameters, we set β = 4, γ = 0.1, λ = 1e -4 for the proposed method on all datasets. A.1 THE PROOF OF PROPOSITION 4.1 To prove non-identifiability, it is sufficient to show that several different graph structures can lead to the same observed data. In particular, let us consider the net effect of n c on x. There are two different paths to 'explain' the net effect of n c on x. One path is n c → z c → x. In this case, since we have no limitation on the function class of edges, we can cut the path z c → z s off (e.g., the left sub-figure of Figure 4 ) and obtain the same observed data depicted by the right sub-figure of 



For convenience in the later parts, with a slight abuse of definition, independent ni means that ni are mutually independent conditional on the observed variable D. See APPENDIX A.3 for detailed assumptions. pret the change of the joint distribution, which may be restricted in some real applications. Hence, this work considers a new and milder assumption, latent covariate shift. Specifically, we propose a latent causal model to precisely formulate the generative process of input features and label. We show that the latent content variable in the proposed latent causal model can be identified up to scaling. This inspires a new method to learn the invariant label distribution conditional on the latent causal variable, resulting in a principled way to guarantee generalization to target domains. Experiments demonstrate the theoretical results and the efficacy of the proposed method, compared with state-of-the-art methods across various data sets.



Figure 1: The illustration of three different assumptions for MSDA. (a) Covariate Shift: pD(x) changes across domains, while pD(y|x) is invariant across domains. (b) Conditional Shift: pD(y) is invariant, while pD(x|y) changes across domains. (c) Latent Covariate Shift: pD(zc) changes across domains while pD(y|zc) is invariant across domains.

Figure 2: The proposed latent causal model. z c causes y: Previous works consider the causal relation between x and y as y → x (Gong et al., 2016;Stojanov et al., 2019;Li et al., 2018), while we employ z c → y. We argue that these two cases are not contradictory since the labels y in these two cases represent two different physical meanings. To understand this point, let ŷ replace y in the first case (i.e., ŷ → x) to distinguish from y in the second case (i.e., z c → y). For the first case, consider the following generative process of images. A label should be first sampled, e.g., ŷ, then one may determine content information regarding to the label ŷ, and finally generate a image, which is a reasonable assumption in many real application scenarios. In the proposed latent causal model, n c play a role to replace ŷ and causes the content variable z c . For the second case, z c → y, it formulates the process that experts extract content information from given images and then provide reasonable labels according to their domain knowledge. This assumption has been employed by some recent works(Mahajan et al., 2021;Liu et al., 2021;Sun et al., 2021). Particularly, these two different labels, ŷ and y, has been simultaneously considered inMahajan et al. (2021).

Figure 4: Two equivalent graph structures.

Figure 5: The proposed iLCC-MSDA to learn the invariant p(y|n c ) for multiple source domain adaptation. C denotes concatenation, and S denotes sampling from the posterior distributions.

Figure 6(a) shows the true and recovered distributions of the exogenous variables n c . Due to the support of nonlinear ICA, the proposed iLCC-MSDA obtain the mean correlation coefficient (MCC) 0.96 between the original n c and the recovered. Due to the invariant conditional distribution p(y|n c ), even with the change of distribution of the exogenous variables n c as shown in Figure 6(a), the learned p(y|n c ) can generalize to target segment in a principle way as depicted by the Figure 6(b). Due to the limited space, Figure 6(b) only shows 200 samples for the true and predicted y.

Figure 6: The Result on Synthetic Data.

(b)  where the label distribution remains unchanged, while it is not appropriate for the proposed setting where the label distribution changes across domains, as shown in Figure1(c). Therefore, we randomly sample the original PACS dataset to provide new PACS dataset where the label distribution changes, to generate three datasets, PACS (D KL = 0.3), PACS (D KL = 0.5) and PACS (D KL = 0.7). Here D KL = 0.3(0.5, 0.7) denotes that KL divergence of label distributions in any two different domains is approximately 0.3 (0.5, 0.7). See APPENDIX for details of label distributions. Baselines We compare the proposed method with state-of-the-art methods to verify its effectiveness. Particularly, we compare the proposed methods with empirical risk minimization (ERM), MCDA(Saito et al., 2018), M3DA(Peng et al., 2019), LtC-MSDA(Wang et al., 2020), T-SVDNet(Li et al., 2021), IRM(Arjovsky et al., 2019), IWCDAN (Tachet des Combes et al., 2020) and LaCIM(Sun et al., 2021). In these methods, MCDA, M3DA, LtC-MSDA and T-SVDNet learn invariant representations for MSDA, while IRM, IWCDAN and LaCIM learn invariant conditional distributions, allowing label distribution to change. Details of implementation, including network architectures and hyper-parameter setting, are in the APPENDIX. All the proposed methods are averaged over 3 runs with standard deviation.

Figure 9: The t-SNE visualizations of learned features n c of different domains on the →L7 task in TerraIncognita. (a) The learned features n c in the L28 domain (b) The learned features n c in the L43 domain (c) The learned features n c in the L46 domain (d)The learned features n c in the L7 domain. We can observe that the distribution of learned feature n c by the proposed method changes across domains, which is very different with the previous methods based on learning invariant representations.

Classification results and ablation study on PACS data. Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31, 2018. Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523-7532. PMLR, 2019. Sicheng Zhao, Bo Li, Pengfei Xu, Xiangyu Yue, Guiguang Ding, and Kurt Keutzer. Madan: multisource adversarial domain aggregation network for domain adaptation. International Journal of Computer Vision, 129(8):2399-2424, 2021.

A APPENDIX

Data Details The commonly used datasets for multi-source domain adaptation, such as Digitsfive, Office-Home, PACS, and DomainNet, are not considered in this work, because for these dataset the label distributions of any two domains is very similar, which is suitable for domain adaptation with conditional shift as shown in Figure 1 (b). However, these datasets are not appropriate for the proposed setting where the label distribution changes across domains, as shown in Figure (c ). Therefore, we resample the original PACS (Li et al., 2017) dataset, which contains 4 domains, Photo, Artpainting, Cartoon and Sketch, which shares the same seven categories. The KL divergence of label distributions of any two domains in the original PACS is very small, round 0.1. For obtaining dataset that meets the requirement of the proposed latent covariate shift, we filter the original dataset by re-sampling it, and obtain three new datasets with different the KL divergences of label distributions as depicted by Figure 7 . The resampling process just randomly select some sample from the original PACS dataset, so that the labels distribution changes across domains. The labels distribution are depicted by Figure 7 . For Terra Incognita (Beery et al., 2018) , the label distribution is long-tailed at each domain, and each domain has a different label distribution, which is naturally applicable for our setting. This work uses the four domains from the original data, L28, L43, L46 and L7, which shares the same seven categories: bird, bobcat, empty, opossum, rabbit, raccoon, skunk, as depicted by Figure 8 . (Here 'L8' denotes the image data is collected from the location 28.) Implementation Details For the synthetic data, we used a encoder, e.g. 3-layer fully connected network with 30 hidden nodes for each layer, and decoder, e.g. 3-layer fully connected network with 30 hidden nodes for each layer. We use 3-layer fully connected network with 30 hidden nodes for prior model. Since this is a ideal environment to verify the proposed method, for hyper-parameters, we set β = 1 and γ = 0 to remove the heuristic constraints, and we set λ = 1e -2. For the real data, all methods used the same network backbone, ResNet-18 pre-trained on ImageNet. Since it can be challenging to train VAE on high-resolution images, we use extracted features by ResNet-18 as our VAE input. We then use 2-layer fully connected networks as the VAE encoder and decoder, use Assume the following holds:(i) The set {x ∈ X |φ ε (x) = 0} has measure zero (i.e., has at most countable number of elements), where φ ε is the characteristic function of the density p ε .(ii) The function f in Eq. 10 is bijective.(iii) There exist 2d + 1 distinct points D 0 , d 1 , ..., D 2d such that the matrixof size 2d × 2d is invertible, where d is the dimension of n, η(D) denotes the vector of their coefficients, which depends on β i,1 and β i,2 .then the true latent variables n are related to the estimated latent variables n by the following relationship: n = Pn + c, where P denotes the permutation matrix with scaling, c denotes a constant vector.Eq. 9 is enforcing Gaussian distributions on the latent noise variables n. Note that the assumptions of nonlinear ICA (Khemakhem et al., 2020) on the noise could be broad exponential family distribution, e.g., Gaussian distributions, Laplace, Gamma distribution and so on. This work consider Gaussian distribution, mainly because we implement it in our experiment, as shown in Eqs. 1 and 2.

