CAKE: CAUSAL AND COLLABORATIVE PROXY-TASKS LEARNING FOR SEMI-SUPERVISED DOMAIN ADAPTA-TION

Abstract

Semi-supervised domain adaptation (SSDA) adapts a learner to a new domain by effectively utilizing source domain data and a few labeled target samples. It is a practical yet under-investigated research topic. In this paper, we analyze the SSDA problem from two perspectives that have previously been overlooked, and correspondingly decompose it into two key subproblems: robust domain adaptation (DA) learning and maximal cross-domain data utilization. (i) From a causal theoretical view, a robust DA model should distinguish the invariant "concept" (key clue to image label) from the nuisance of confounding factors across domains. To achieve this goal, we propose to generate concept-invariant samples to enable the model to classify the samples through causal intervention, yielding improved generalization guarantees; (ii) Based on the robust DA theory, we aim to exploit the maximal utilization of rich source domain data and a few labeled target samples to boost SSDA further. Consequently, we propose a collaboratively debiasing learning framework that utilizes two complementary semi-supervised learning (SSL) classifiers to mutually exchange their unbiased knowledge, which helps unleash the potential of source and target domain training data, thereby producing more convincing pseudo-labels. Such obtained labels facilitate crossdomain feature alignment and duly improve the invariant concept learning. In our experimental study, we show that the proposed model significantly outperforms SOTA methods in terms of effectiveness and generalisability on SSDA datasets.

1. INTRODUCTION

Domain Adaptation (DA) aims to transfer training knowledge to the new domain (target D = D T ) using the labeled data available from the original domain (source D = D S ), which can alleviate the poor generalization of learned deep neural networks when the data distribution significantly deviates from the original domain Wang & Deng (2018) ; You et al. (2019) ; Tzeng et al. (2017) . In the DA community, recent works Saito et al. (2019) have shown that the presence of few labeled data from the target domain can significantly boost the performance of deep learning-based models. This observation led to the formulation of Semi-Supervised Domain Adaptation (SSDA), which is a variant of Unsupervised Domain Adaptation (UDA) Venkateswara et al. (2017) to facilitate model training with rich labels from D S and a few labeled samples from D T . For the fact that we can easily collect such additional labels on the target data in real-world applications, SSDA has the potential to render the adaptation problem more practical and promising in comparison to UDA. Broadly, most contemporary approaches Ganin et al. (2016) ; Jiang et al. (2020) ; Kim & Kim (2020) ; Yoon et al. (2022) handle the SSDA task based on two domain shift assumptions, where X and Y respectively denote the samples and their corresponding labels: (i) Covariate Shift, P (X |D = D S ) ̸ = P (X |D = D T ); (ii) Conditional Shift, P (Y|X , D = D S ) ̸ = P (Y|X , D = D T ), refers to the difference of conditional label distributions of cross-domain data. Intuitively, one straightforward solution for SSDA is to learn the common features to mitigate the domain shift issues. Further quantitative analyses, however, indicate that the model trained with supervision on a few labeled target samples and labeled source data can just ensure partial cross-domain feature alignment Kim & Kim (2020) . That is, it only aligns the features of labeled target samples and their correlated nearby samples with the corresponding feature clusters in the source domain. To systematically study the SSDA problem, we begin by asking two fundamental questions, Q1:What properties should a robust DA model have? To answer this question, we first present a DA example in Figure 1 (a), which suggests that the image "style" in D = D T is drastically different from the D = D S . A classifier trained on the source domain may fail to predict correct labels even though the "concept" (e.g., plane) is invariant with a similar outline. The truth is that the minimalist style features being invariant in "clipart" domain plays a critical factor in the trained classifier, which may consequently downplay the concept features simply because they are not as invariant as style features. Importantly, such an observation reveals the fundamental reason of the two domain shift assumptions, i.e., P (Style = clipart|D = D S ) ̸ = P (Style = real|D = D T ). Therefore, a robust DA model needs to distinguish the invariant concept features in X across domains from the changing style. Q2: How to maximally exploit the target domain supervision for robust SSDA? As discussed, supervised learning on the few target labels cannot guarantee the global cross-domain feature alignment, which hurts the model generalization for invariant learning. A commonly known approach in this few labeled setting, semi-supervised learning (SSL), uses a trained model on labeled data to predict convincing pseudo-labels for the unlabeled data. This approach relies on the ideal assumption that the labeled and unlabeled data have the same marginal distribution of label over classes to generate pseudo-labels. However, Figure 1 (b) indicates these distributions are different in both inter-domain and intra-domain. This may result in the imperfect label prediction that causes the well-known confirmation bias Arazo et al. (2020) , affecting the model feature alignment capability. Further, in the SSDA setting, we have three sets of data, i.e., source domain data, labeled and unlabeled target domain data. One single model for SSDA may be hard to generalize to the three sets with different label distributions. Thus, the premise of better utilization of labeled target samples is to mitigate undesirable bias and reasonably utilize the multiple sets. Summing up, these limitations call for reexamination of SSDA and its solutions. To alleviate the aforementioned limitations, we propose a framework called CAusal collaborative proxy-tasKs lEarning (CAKE) which is illustrated in Figure 1(c ). In the first step, we formalize the DA task using a causal graph. Then leveraging causal tools, we identify the "style" as the confounder and derive the invariant concepts across domains. In the subsequent steps, we build two classifiers based on the invariant concept to utilize rich information from cross-domain data for better SSDA. In this way, CAKE explicitly decomposes the SSDA into two proxy subroutines, namely Invariant Concept Learning Proxy (ICL) and Collaboratively Debiasing Learning Proxy (CDL). In ICL, we identify the key to robust DA is that the underlying concepts are consistent across domains, and the confounder is the style that prevents the model from learning the invariant concept (C) for accurate DA. Therefore, a robust DA model should be an invariant predictor P (Y| X , D = D T ) = P (Y| X , D = D S )) under causal interventions. To address the problem, we devise a causal factor generator (CFG) that can produce concept-invariant samples X with different style to facilitate the DA model to effectively learn the invariant concept. As such, our ICL may be regarded as an improved version of Invariant Risk Minimization (IRM) Arjovsky et al. (2019) for SSDA, which equips the model with the ability to learn the concept features that are invariant to styles. In CDL, with the invariant concept learning as the foundation, we aim to unleash the potential of three sets of cross-domain data for better SSDA. Specifically, we build two correlating and complementary pseudo-labeling based semi-supervised learning (SSL) classifiers for D S and D T with self-penalization. These two classifiers ensure that the mutual knowledge is exchanged to expand the number of "labeled" samples in the target domain, thereby bridging the feature distribution gap. Further, to reduce the confirmation bias learned from respective labeled data, we adopt Inverse Propensity Weighting (IPW) Glynn & Quinn (2010) theory which aims to force the model to pay same attention to popular ones and tail ones in SSL models. Specifically, we use the prior knowledge of marginal distribution to adjust the optimization objective from P(Y|X ) to P(X |Y) (Maximizing the probability of each x ∈ X with different y ∈ Y ) for unbiased learning. Thus, the negative impact caused by label distribution shift can be mitigated. Consequently, the two subroutines mutually boost each other with respect to their common goal for better SSDA. We shall start by grounding the domain adaptation (DA) in a causal framework to illustrate the key challenges of cross-domain generalization. As discussed in introduction, given data X and their labels Y, the main difficulty of DA is that the extracted representation from X is no longer a strong visual cue for sample label in another domain. To study this issue in-depth, we first make the following assumption:  The "devil" for DA problem could be style confounders S C and S I in that they prevent the model from learning the concept-invariant causality X → Y 1 . From the causal theoretical view, such confounding effect can be eliminated by statistical learning with causal intervention Pearl et al. (2000) . Putting all these observations together, we now state the main theorem of the paper. Theorem 1 (Causal Intervention). Under the causal graph in Figure 2 and Assumption 1, we can conclude that under this causal model, performing interventions on S C and S I does not change the P (Y|X ). Thus, in DA problem, the causal effect P (Y|do(X ) 2 , D = D T ) can be computed as: P (Y|do(X ), D = D T ) = P (Y|do (C, S C , S I ) Disentangled Variables , D = D T ) = D∈{D S ,D T } ŝC ∼S C ŝI ∼S I P (Y|C, ŝC , ŝI , D)P (C, ŝC , ŝI , D) ≈ x∼ X P (Y|X , X = x)P (X , X = x), 1 While this assumption may not be true in all settings, we believe that the single image classification can be approximated by this assumption. More discussion about this assumption is in the appendix. 2 P (Y|do(X ), D = DT ) uses the do-operator Glymour et al. (2016) . Given random variables X , Y, we write P (Y = y|do(X = x)) to indicate the probability of Y = y when we intervene and set X to be x. where X are the invariant causal factors with the same concepts of X but contain different cross/intradomain styles, i.e., invariant concept-aware samples. Realistically, X is often a large set due to the multiple style combinations. This may block the model's computational efficiency according to Eq. 2 and hard to obtain such numerous causal factors. However, it is non-trivial to personally determine the X size to study the deconfounded effect. We employ a compromise solution that significantly reduces the X size to a small number for causal intervention.

3. CAKE: CAUSAL AND COLLABORATIVE PROXY-TASKS LEARNING

This section describes the CAKE for Semi-Supervised Domain Adaptation (SSDA) based on the studied causal and collaborative learning. We shall present each module and its training strategy.

3.1. PROBLEM FORMULATION

In the problem of SSDA, we have access to a set of labeled samples S l = {(x (i) sl , y (i) sl )} Ns i=1 i.i.d from source domain D S . And the goal of SSDA is to adapt a learner to a target domain D T , of which the training set consists of two sets of data: a set of unlabeled data T u = {(x (i) tu )} Nu i=1 and a small labeled set T l = {(x (i) tl , y (i) tl )} N l i=1 . Typically, we have N l ≤ N u and N l ≪ N s . We solve the problem by decomposing the SSDA task into two proxy subroutines: Invariant Concept Learning (ICL) and Collaboratively Debiasing Learning (CDL). Such subroutines are designed to seek a robust learner M(•; Θ) which performs well on test data from the target domain: M(•; (Θ I , Θ C )) Learner CAKE : M I (( Ŝl , Tl , Tu )|(S l , T u , T l )); Θ I ) ICL Proxy Subroutine ↔ M C ((T p |(S l , Ŝl , T l , Tl , T u , Tu )); Θ C ) CDL Proxy Subroutine (3) where M I and M C indicate the ICL model parameterized by Θ I and the CDL model parameterized by Θ C respectively. In ICL proxy, M I (•; Θ I ) learns the causal factors ( Ŝl , Tl , Tu ) for D S and D T in unsupervised learning paradigm, aiming to generate the invariant causal factors and use Eq. 2 to remove the confounding effect. In CDL aspect, we construct two pseudo labeling-based SSL techniques: (S l , Ŝl ) → T u and (T l , Tl ) → T u , aiming at utilizing all the training data possible to bridge the feature discrepancy under the premise of invariant concept learning.

3.2. INVARIANT CONCEPT LEARNING PROXY

As we discussed in Section 2, the key to robust DA is to eliminate the spurious correlations between styles (S C and S I ) and label Y. To tackle this problem, we propose an approximate solution to kindly remove the confounding effect induced by S C and S I . In detail, we develop the two invariant causal factor generators that can produce the causal factors X with C. Next, we propose the Invariant Concept Learning (ICL) loss function, which forces the backbone (e.g., ResNet-34 He et al. (2016) ) to focus on learning concepts that are invariant across a set of domains.

3.2.1. INVARIANT CAUSAL FACTOR GENERATOR

Achieving the invariant concept-aware X is challenging due to the fact that supervised signals are missing or expensive to obtain. Thus, we resort to the unsupervised learning paradigm, designing two causal factor generators C f g (•)=C f g C (•) (cross-domain) and C f g I (•) (intra-domain) to achieve X for D S and D T without the reliance on the supervised signals. Take D = D S as an example, the invariant causal factors of S l is given by Ŝl (2018) , enabling the source concept to be preserved during the cross-domain conversion process. By considering the huge domain discrepancy, we optimize the style transfer loss as follows: = { Ŝt l , Ŝs l } = {C f g C (S l ), C f g I (S l )} w. min G k st max D k t L k st (•; Θ F ) = E x sl ∼S l ,xt∼[Tu;T l ] [logD k t (x t ) + log(1 -D k t (G k st (x sl ))) + L k cyc (x sl , x t ; Θ F ) + L k idt (x sl , x t ; Θ F )], k = argmin {i∈1,••• ,Ng} L i st (•; Θ F ), where [•; •] represents the union of two inputs, D t is the discriminator to distinguish the original source of the latent vector if from D T . L k cyc and L k idt are the cycle and identity loss Zhu et al. (2017) . G k st is k th C f g C . Through min-max adversarial training, the domain style-changing samples are obtained. Intra-domain Causal Factor. We utilize the image augmentations as intra-domain style interventions, e.g., modifying color temperature, brightness, and sharpness. We randomly adjust these image properties as our mapping function to change the intra-domain style for D S with invariant concept. Thus, the invariant causal factors Ŝl = { Ŝt l , Ŝs l } are produced. Correspondingly, for the target domain, Tl and Tu are also obtained in the generating learning strategy.

3.2.2. ICL OPTIMIZATION OBJECTIVE

After obtaining a set of invariant concept-aware samples S l for source domain D S , the goal of the proposed ICL can thus be formulated as the following optimization problem: min Θ b I ,Θ c I L icl (•; (Θ b I , Θ c I )) = E (x sl ,y sl )∼[S l , Ŝt l , Ŝs l ] [L cls (Φ(x sl ; Θ b I ), y sl ; Θ c I ) + λ ir • L ir (•; Θ b I )] s.t. Θ b I = arg min Θb I x sl ∼S l ( G∈{C,I} d(Φ(x sl ), f (C f g G (x sl ))) + d(Φ(C f g C (x sl )), Φ(C f g I (x sl ))) where Θ b I and Θ c I are learnable parameters for the backbone and classifier, respectively. Φ(x sl ; Θ b I ) is the backbone extracting feature from xsl . λ ir is the trade-off parameter and d(•) is the euclidean distance between two inputs. L cls (Φ(x sl ; Θ b I ), y sl ; Θ c I ) is the cross-entropy loss for classification. To further access the concept-invariant learning effect, we develop the invariant regularization loss L ir (•; Θ b I ) through a regularizer. We feed the S l , Ŝs l , Ŝt l into the backbone network and explicitly enforcing them have invariant prediction, i.e., KL(P (Y|S l ), P (Y| Ŝs l ), P (Y| Ŝt l )) ≤ ϵfoot_0 . Such regularization is converted to an entropy minimization process McLachlan (1975) , which encourages the classifier to focus on the domain-invariant concept and downplay the domain-variant style. The key idea of ICL similarly corresponds to the principle of invariant risk minimization (IRM) which aims to model the data representation for invariant predictor learning. More discussion about IRM and ICL is in the appendix.

3.3. COLLABORATIVELY DEBIASING LEARNING PROXY

After invariant concept-aware samples generation, we obtain the Ŝl , Tl and Tu . Next, we will elaborate on how to utilize the advantages of the extra supervised signals of target domain data T l over the UDA setting. We introduce the Collaboratively Debiasing Learning framework (CDL) based on the robust DA setting with causal intervention. Specifically, we construct two SSL models: M s C (•; Θ s C ) w.r.t {S l , Ŝl and T u } and M t C (•; Θ t C ) w.r.t {T l , Tl and T u } as two complementary models with the same network architecture, which can cooperatively and mutually produce the pseudo-labels for each other to optimize the parameters Chen et al. (2011) ; Qiao et al. (2018) . For instance, pseudo-label ȳ of x tu ∼ T u from M s C (•; Θ s C ) is given by: ȳ = arg max ỹ (P (ỹ|x tu ; Θ s C ) > τ s ), where ỹ = M t C (x tu ; Θ t C ) if P (ỹ|x tu ; Θ t C ) > τ t (6) where τ s and τ t are the predefined threshold for pseudo-label selection. We will further elaborate on the two components in CDL, namely, a debiasing mechanism and a self-penalization technique. Without loss of generality, we describe the components using one of the SSL models M s C (•; Θ s C ).

3.3.1. CONFIRMATION BIAS ELIMINATING MECHANISM

The ultimate objective of most SSL frameworks is to minimize a risk, defined as the expectation of a particular loss function over a labeled data distribution (X , Y) ∼ S l Van Engelen & Hoos (2020). Therefore, the optimization problem generally becomes finding Θ s S that minimizes the SSL risk. min Θ s C R(•; Θ s C ) = E (x sl ,y sl )∼S l [Ent s ((x sl , y sl ); Θ s C )] + E xtu∼Tu [λ u • Ent u (x tu ; Θ s C )], s.t. Θ s C = arg max Θs C (x sl ,y sl )∼S l logP s (y sl |x sl ; Θs C )] where λ u is the fixed scalar hyperparameter denoting the relative weight of the unlabeled loss. Ent s (•) and Ent t (•) are the cross-entropy loss function for labeled data S l and unlabeled data T u . Proposition 1 (Origin of Confirmation Bias). SSL methods estimate the model parameters Θ s C via maximum likelihood estimation according to labeled data (X , Y) ∼ S l . Thus, the confirmation bias B c in SSL methods is generated from the fully observed instances, namely labeled data. Under this proposition, the unbiased SSL learner should be impartial for less popular data (e.g., tail samples X t ) and popular ones (e.g., head samples X h ), i.e., P (X h |Y) = P (X t |Y). Inspired by the inverse propensity weighting Glynn & Quinn (2010) theory, we get the unbiased theorem for SSL. Theorem 2 (Unbiased SSL Label Propagator). The optimization parameter Θ s C for SSL model should be taken same attention for all the labeled data, i.e., turn maximizing x sl ∈S l logP (x sl |y sl ); Θ s C ) (Complete proof in Appendix.). Θ s C = arg max Θs C (x sl ,y sl )∼S l logP (y sl |x sl ; Θs C ) = arg max Θs C x sl ∼S l logP (x sl |y sl ; Θs C ) • S IP W (x sl , y sl ) S IP W (x sl , y sl ) = (x sl ,y sl )∼S l P (y sl |x sl ; Θs C )/(logP (y sl |x sl ; Θ s C ) -logP (y sl ; Θs C )) where S IP W (•) is the Inverse Probability Weighting score. This formula can be understood as using the prior knowledge of marginal distribution P (Y; Θ s C ) to adjust the optimization objectives for unbiased learning. To make practical use of this Eq. 20, we estimate P (Y; Θ s C , B s , t) in each mini-batch training for error backpropagation at iteration t with batch size B s . It is noteworthy that we use a distribution moving strategy over all the iterations to reduce the high-variance estimation between time adjacent epochs. With the gradual removal of bias from the training process, the performance gap between classes also shrinks, and both popular and rare classes can be fairly treated.

3.3.2. SELF-PENALIZATION OF INDIVIDUAL CLASSIFIER

We also design a self-penalization that encourages the SSL model to produce more convincing pseudo-labels for exchanging peer classifier knowledge. Here, the negative pseudo-label indicates the most confident label (top-1 label) predicted by the network with a confidence lower than the threshold τ s . Since the negative pseudo-label is unlikely to be a correct label, we need to increase the probability values of all other classes except for this negative pseudo-label. Therefore, we optimize the output probability corresponding to the negative pseudo-label to be close to zero. The objective of self-penalization is defined as follows: min Θ s C L sp (•; Θ s C ) = E (xtu,ytu)∼ Tu [1(max(P (y tu |x tu ; Θ s C ) < τ s )) • y tu log(1 -P (y tu |x tu ; Θ s C ))] (9) Such self-penalization is able to encourage the model to generate more faithful pseudo-labels with a high-confidence score, and hence improve the data utilization for better invariant learning. 

4.2. EXPERIMENTAL RESULTS AND ANALYSES

Comparison with SOTA Methods. Table 1 , and 7 (in appendix) summarize the quantitative threeshot results of our framework and baselines on DomainNet and Office-Home. The one-shot results and analysis are in the supplementary material. In general, irrespective of the adaptation scenario, CAKE achieves the best performance on almost all the metrics to SOTA on the two datasets. In particular, CAKE outperforms other baselines in terms of Mean Accuracy by a large margin (DomainNet: 1.2% ∼ 16.4%, Office-Home: 3.3% ∼ 9.7% and Office: 3.8% ∼ 12.0%) for SSDA task. Notably, our baseline, a simplified variant of CAKE without causal intervention and debiasing operation also obtained comparable results compared with SOTA (-3.6%). These results both benefit from the carefully designed ICL and CDL proxy subroutines that demonstrate the superiority and generalizability of our proposed model. Individual Effectiveness of Each Component. We conduct an ablation study to illustrate the effect of each component in Table 2 , which indicates the following: Causal Inference is critical to boost SSDA (Row 5 vs. Row 6), which significantly contributes 2.4% and 1.9% improvement on DomainNet and Office-Home, respectively. Meanwhile, Row 1 indicates that it suffers from noticeable performance degradation without the bias-removed mechanism (Row 1) (-1.3% and -1.8%). Furthermore, the results of Row 3 and Row 4 severally show the performance improvement of the Invariant Regularization (L ir ) and Self-penalization (L sp ). Summing up, We can observe that the improvement of using either module alone is distinguishable. Combining all the superior components, our CAKE exhibits steady improvement over the baselines. Maximally Cross-domain Data Utilization. Here, we evaluate the effectiveness of data utilization of the proposed method. Figure 3 than baseline, and outperforms baseline by a large margin of accuracy. CAKE also produces more convincing pseudo-labels than baseline. These pseudo-labels can assist the SSDA in performing global domain alignment to decrease the intra-domain discrepancy for robust invariant concept learning. Apart from learning visualization, we also investigate the CAKE's sensitivity to the confidence threshold τ for assigning pseudo-labels. Figure 3 (c ) empirically provides an appropriate threshold, i.e., τ =0.5, either increasing or decreasing this value results in a performance decay. What's more, we conducted the cooperation vs. solo ablation that verifies the power of collaborative learning in Table 3 . The detached SSL model performs worse, demonstrating that training two correlated models achieves better adaptation as opposed to only aligning one of them, because collaborative learning allows both models to learn common knowledge from different domains that in turn facilitates invariant learning. The aforementioned observation and analysis verify the effectiveness of CAKE in being able to deeply mine the potential of cross-domain data, thereby achieving the SSDA improvement. Effect of Confirmation Bias Eliminating. To build insights on the unbiased SSL in CAKE, we perform an in-depth analysis of the bias-eliminating mechanism in Figure 3(d) . In this experiment, we randomly select 10 classes (5 head and 5 tail). The results suggest that CAKE and its variant CAKE (w/ bias) obtain a comparable performance on the head class. However, CAKE (w/ bias) fails to maintain the consistent superiority on the tail class while our approach does. (e.g., tail class 69, CAKE: 46.0% , CAKE (w/ bias) : 36.2%). This phenomenon is reasonable since CAKE maintains unbiasedness to each class-wise sample by maximizing P (X |Y). As the labeled/unlabeled data share the same class distribution, the accuracy of the tail class can be improved. In contrast, CAKE (w/ bias) focuses more on the head class, which results in an unbalanced performance for all categories. These results empirically verified our theoretical analysis and the robustness of the debiasing mechanism, which provides a reliable solution that guarantees the mutual data knowledge to be exchanged from source and target aspects. . Across all scenes, the best performance is usually achieved with N g = 2, except for P → C. This ablation proves the ICFs of X can be learned from a set of limited style-changing samples. Appropriately using these ICFs to conduct the deconfounded operation can effectively improve the SSDA performance. Grad-CAM Results of Causal Intervention. We systematically present the explicit benefits of the invariant concept learning (ICL). on the irrelevant information of style S= "cluttered", therefore predicting the wrong class. On the contrary, CAKE can attend to the vital image regions by learning the invariant concept C= "celling fan" through the deconfounded mechanism. Cross-domain Feature Alignment. We employ the t-SNE Van der Maaten & Hinton (2008) to visualize the feature alignment before/after training of adaptation scenarios D S ="Real" and D T ="Clipart" on DomainNet. We randomly select 1000 samples (50 samples per class). Our invariant concept learning focuses on making X S and X T alike. It can be observed that as the model optimization progresses, e.g., C="teapot", the target features gradually converge toward target cluster cores. Each cluster in the target domain also gradually moves closer to its corresponding source cluster cores, showing a cluster-wise feature alignment effect. This provides an intuitive explanation of how our CAKE alleviates the domain shift issue. 

6. CONCLUSION

We first propose a causal framework to pinpoint the causal effect of disentangled style variables, and theoretically explain what characteristics should a robust domain adaptation model have. We next discuss the maximal training data utilization and present a collaboratively debiasing learning framework to make use of the training data to boost SSDA effectively. We believe that CAKE serves as a complement to existing literature and provides new insights to the domain adaptation community. This is the Appendix for "CAKE: CAusal and collaborative proxy-tasKs lEarning for Semi-Supervised Domain Adaptation". Table 4 summarizes the abbreviations and the symbols used in the main paper. 

Self-penalization Loss

This appendix is organized as follows: • Section 7 provides the proof about the disentangled X causal intervention, invariant risk minimization, unbiased eliminating mechanism and further discussion of Assumption 1. • Section 8 provides the method details of proposed CAKE. • Section 9 reports more experimental settings of datasets, baselines, implementation details and training process of CAKE. • Section 10 shows the additional experiments on DomainNet and Office-Home Venkateswara et al. (2017) to verify the effectiveness of CAKE. • Section 11 lists the limitations of this paper.

7. PROOF AND DERIVATION

This section derives the disentangled causal intervention and the inverse probability weighting theory for the confirmation of unbiased eliminating mechanism.

7.1. DISENTANGLED X CAUSAL INTERVENTION

We will first provide a brief introduction to the preliminaries of disentangled X causal intervention. Real-world observations, according to physicists, are the result of a mix of independent physical rules. This also applies to causal inference Higgins et al. (2018) , i.e., the laws are denoted as disentangled generative factors, such as shape, color, and position. Let the X represent the image data, each X can be disentangled into concept C, cross-domain style S C and intra-domain style S I variables which are mutually independent, i.e., a triplet X = (C, S C , S I ), where C ⊥ ⊥ S C ⊥ ⊥ S I . Correspondingly, the invariant causal factors of X are given by X ( style mapping results of X ), where S ∈ {S C , S I } are different from X . Only concept C is relevant for the true label Y of X , i.e., style changing is conceptpreserving. In other words, there is a set of independent causal mechanisms ϕ: S → X , X , generating images from S. To study how can we get the X , we leverage the assumption of disentangled variables based on Higgins' definition of disentangled representation Higgins et al. (2018) . We state the definition as follows: Definition 1 (Group Action on Disentangled Variables). Let G be the group acting on S, i.e, g • s ∈ S × S transforms s ∈ S, e.g., group element of "turn domain style, real to clipart" changing the semantic from "real" to "clipart". Suppose there is a direct product decomposition G = g 1 × g 2 × • • • × g q and S = S 1 × S 2 × • • • × S q , where g i acts on S i respectively. A feature representation is disentangled if there exists a group G acting on X such that: • Theorem 3 ( Decomposable X .) There is a decomposition X = X 1 × X 2 × • • • × X q , such that each X i is fixed by the action of all g j , where j ̸ = i and affected only by g i , e.g., changing the "domain style" semantic in S does not affect the "concept" vector in X . P (X j = concept|g i • X i ) = P (X j = concept|X i ) • Theorem 4 (Equivariant Semantic Changing.) ∀g ∈ G, ∀s ∈ S, f (g • s) = g • f (s), e.g., the feature of the changed cross-domain style semantic: "real" to "clipart" in S, is equivalent to directly change the style vector in X from "real" to "clipar". P (g • s|S) = P (g • s|X ) Under this Theorem 3, the disentangled representations are obtained by our Assumption 1, i.e., X = (C, S C , S I ). Compared to the previous definition of feature representation which is a static mapping, the disentangled representation in Definition 1 is dynamic as it explicitly incorporates group representation Williams (2002) , which is a homomorphism from group to group actions on a space, e.g., G → X × X , and it is common to use the feature space X . What's more, Theorem 4 indicates that performing group action of semantic changing (e.g., style changing) g on S is equivariant for S and X . Thus, X can be obtained by performing different g ∈ G on X . Next, we can introduce the causality that allows computing how an outcome would have changed, had some variables taken different values, referred to as a causal intervention. As a prerequisite, X should be calculated following the three steps of computing principles Pearl & Mackenzie (2018): • In abduction, all the invariant causal factors, i.e., ( X = x1 , X = x2 , • • • , X = xk ) are inferred from X through P ( X |X ) . • In action, X = xi is drawn from P ( X = xi |D = D S ) or P ( X = xi |D = D T ), while the values of other X are fixed. • In prediction, the modified ( X = x1 , X = x2 , • • • , X = xk ) is fed to the generative process P (X | X ) to obtain the output. More details can be found in Glymour et al. (2016) . Based on the computing principles, we consider G as an embedded function Besserve et al. (2018) , i.e., a continuous injective function with continuous inversion, which generally holds for convolution-based networks as shown in Puthawala et al. (2020) . X are obtained through the generative process G : X → X . X with the invariant concept with X can be regarded as the causal factors to take a causal theoretical view for the domain adaptation problem. Proof of the Sufficient Condition. Suppose that the representation is fully disentangled w.r.t. G. By Definition 1, there exists subspace X i ∈ X affected only by the action of g i ∈ G. This part aims to prove the following sufficient condition: if g i intervenes S i , the invariant causal factors(ICFs) are faithful when the L i st is smallest or the i th image augmentation have not change the concept. For a sample x S from D = D S , let g -1 (x S ) = S = (S 1 , • • • S 2 , • • • S k ). We modify style by changing g i ∈ G drawn from P (g i |D = D T ) (cross-domain style) or P (g i |D = D S ) (intra-domain style). Denote the modified style as Ŝ = ( Ŝ1 × Ŝ2 × • • • × Ŝk ). Denote the sample with style Ŝ as xS . Given g i intervenes S corresponds to a counterfactual outcome when S i is set to Ŝi through intervention. Now as g -1 (x S ) = S, using the counterfactual consistency rule, we have g i (x S ) = xS . As xS is faithful with the Counterfactual Faithfulness theorem Pearl et al. (2000) , we prove that g i (x S ) is also faithful, i.e., the smallest L i st or the i th image augmentation have not change the concept.

7.2. INVARIANT RISK MINIMIZATION

In a seminal work, Arjovsky et al. (2019) consider the question that data are collected from multiple envrionments with different distributions where spurious correlations are due to dataset biases. This part of spurious correlation will confuse model to build predictions on unrelated correlations rather than true causal relations. IRM estimates invariant and causal variables from multiple environments by regularizing on predictors to find data represenation matching for all environments. Let X be the image space, Z and Y represent the be feature space and classification output space (e.g., the set of all probabilities of belonging to each class), the feature extractor backbone Φ : X → Z and the classifier w : Z → Y. Let E tr be a set of training environments, where each e ∈ E tr is a set of images. Mathematically, IRM phrase these goals as the constrained optimization problem: min Φ,w e∈Etr R e (•; Φ) ERM Term +λ • ∇ w|w=1.0 R e (w • Φ) 2 Invariant Risk ( ) where R e (•; Φ) is the empirical classification risk (ERM) in the environment e, w = 1.0 is a scalar and fixed "dumm" classifier, the gradient norm penalty is used to measure the optimality of the dummy classifier at each environment e, and λ ∈ [0, +∞) is a regularizer balancing between predictive power, and the invariance of the predictor 1 • Φ(x). In DA problem, IRM can be regarded as the classic ERM term (e.g., classification loss) plus the invariant risk (e.g., discrepancy between conditional distributions over the features) Li et al. (2021a) . The invariant risk consider about the Conditional Shift assumption (P (Y|X , D = D S ) ̸ = P (Y|X , D = D S )) to learn the data representation, thereby learning the invariant predictor. However, IRM may not be the true savior for the DA task which still has two issues: • Covariate Shift. As we discussed in section 1, the marginal feature distributions are different across domain, i.e., P (X |D = D S ) ̸ = P (X |D = D T ). In SSDA task, there are only few labeled samples in D T , it hard to measure the gap of marginal feature distribution between two domains. Thus, an alternative is consider the Covariate Shift that reduce the features discrepancy across domains. However, IRM is not sufficient to tackle this issue. • Spurious Correlation. The learned global representation of an image still has noise style information rather than the fine-grained concept. Using such global representation may leave confused style information in feature space, resulting in inaccurate prediction. Different from the IRM in DA, our ICL tackled the aforementioned issues at two points: 1) representing images from the source domain D S to target domain D T by invariant causal factor generator; 2) eliminating the spurious correlation of style and label by statistical learning with causal intervention. In summary, our proposed CAKE not only addressed the two domain shift issue, but also enforced the model to learn the disentangled invariant-concepts across domains, which can boost the SSDA reasonably.

7.3. INVERSE PROBABILITY WEIGHTING THEORY

The main content of this paper indicates the unbiased SSL learner should be impartial for less popular data (e.g., tail samples X t ) and popular ones (e.g., head samples X h ), i.e., P (X h |Y) = P (X t |Y). Inspired by the inverse propensity weighting Glynn & Quinn (2010) theory, which introduces a weight for each training sample via its propensity score, which reflects how likely the label is observed (e.g., its popularity). In this way, IPW makes up a pseudo-balanced dataset by duplicating each labeled data inversely proportional to its propensity-less popular samples should draw the same attention as the popular ones-a more balanced imputation. Thus, take D = D S as an example, the optimization parameter Θ s C for SSL model M s C should be taken same attention for all the labeled data. In general SSL training, Θ s C can be optimized as follows:  Θ s C = ⇒ S IP W (.) = logP (y sl |x sl ; Θs C ) logP (y sl |x sl ; Θs C ) -logP (y sl ; Θs C ) where S IP W (•) is the Inverse Probability Weighting score, P (x xl ; Θs C ) is the sampling probability an empirical distribution which is thus constant. Thus, from (4) to ( 5) and ( 7) to (8) in Eq. 13, P (x xl ; Θs C ) can be ignored. Therefore, we can turn maximizing arg max We would like to clarify the further explanation of our proposed assumption 1 (Disentangled Variables). Besides, we also give other intuition of this assumption for other tasks. • Single classification task. We have already noted that this assumption may not be true in all settings, but we believe that many image settings can be approximated. For example, in most of the single classification tasks, a given image (e.g., a dog on the lawn) only has one label. The annotators tend to focus on the most important region (the dog) which can be regarded as the concept to give the label. The dog with other colors or appearance can be regarded as the intra-domain style . The dog in other domain (e.g., clipart domain) with different backgrounds can refer to the cross-domain style . In other words, and are confounders that interfere the model to predict the true label given the image. • Other complex vision tasks. For these tasks, e.g., multi-label task Li et al. (2006) , visual question answer Antol et al. (2015) , visual captioning Vinyals et al. (2015) , this assumption may not be applicable. For instance, an image describes "a dog and a cat on the lawn". When the image is classified as a dog, besides the style confounders , the cat is also an extraneous factor called object confounder. Nevertheless, we also investigated this scenario in and the results systematically show the robustness of the proposed CAKE.

8. METHOD DETAILS

This section presents the method details of the proposed CAKE. Figure . 5 illustrates the overview of our CAKE framework that contains two proxy subroutines: Invariant Concept Learning (ICL) and Collaboratively Debiasing Learning (CDL). Next, we will elaborate on the details of cross-domain style transfer and bias eliminating mechanism in practice.

8.1. CROSS-DOMAIN STYLE TRANSFER

The invariant concept samples with cross-domain style transferring is generated from CycleGAN Zhu et al. (2017) foot_2 . Here we provide the architectures of generator and discriminator in Table 5 . Loss of Cross-domain Invariant Causal Factor Generator. The CycleGAN technique transforms the images to enable the preservation of source concepts during the cross-domain conversion process. Take D S → D T as an example, we develop N g CycleGANs, each of them consisting of three-loss parts to conduct the cross-domain style transfer. arg min Θ I L k st = arg min Θ I (λ adv • L k adv + λ cyc • L k cyc + λ idt • L k idt ), k = argmin {i∈1,••• ,Ng} L i st ( ) where L k adv , L k cyc and L k idt are adversarial loss, cycle loss and identity loss, respectively. λ adv , λ cyc and λ idt correspond to their trade-off parameters. Specifically, the L k adv , L k cyc and L k idt can be calculated as follows: L k adv = E xt∼(Tu,T l ) [logD k t (x t )] + E x sl ∼S l [log(1 -D k t (G k st (x sl )))] L k cyc = E x sl ∼S l [||G k ts (G k st (x sl )) -x sl ||] + E xt∼(Tu,T l ) [||G k st (G k ts (x t )) -x t ||] L k idt = E x sl ∼S l [||G k ts (x sl ) -x sl ||] + E xt∼(Tu,T l ) [||G k st (x t ) -x t ||] where ||•|| denote the L1 norm. D t is the discriminator to distinguish the origin source of the latent vector if from D T . G k st and G k ts correspond to the k th D S → D T and D T → D S cross-domain invariant causal factor generator (ICFG), respectively.

8.2. INTRA-DOMAIN STYLE TRANSFER

For the intra-domain style changing factors, we utilize the data augmentations as intra-domain style interventions,e.g., color temperature and sharpness according to the cross-domain style changing samples. The code of the intra-domain style transfer can be found in our online project. Thus, the invariant causal factors Ŝl = { Ŝt l , Ŝs l } are produced. Correspondingly, for the target domain, Tl and Tu are obtained in the generating learning strategy.

8.3. INVARIANT CONCEPT LEARNING

After obtaining a set of invariant concept-aware samples S l for source domain D S , we design the ICL loss function which has two aspects as follows: min Θ b I ,Θ c I L icl (•; (Θ b I , Θ c I )) = L cls + λ ir • L ir (•; Θ b I ) min Θ b I L icl (•; Θ b I ) = x sl ∼S l ( G∈{C,I} d(Φ(x sl ), f (C f g G (x sl ))) + d(Φ(C f g C (x sl )), Φ(C f g I (x sl ))) where Θ b I and Θ c I are learnable parameters for the backbone and classifier, respectively. Φ(x sl ; Θ b I ) is the backbone extracting feature from xsl . λ ir is the trade-off parameter and d(•) is the euclidean distance between two inputs. L cls (Φ(x sl ; Θ b I ), y sl ; Θ c I ) is the cross-entropy loss for classification. To further access the concept-invariant learning effect, we develop the invariant regularization loss L ir (•; Θ b I ) through a regularizer. Such regularization is converted to an entropy minimization process McLachlan (1975) , which encourages the classifier to focus on the domain-invariant concept and downplay the domain-variant style.

8.4. CONFIRMATION BIAS ELIMINATING MECHANISM

According to complete proof of IPW theory in section.7.3, we aim to find the optimal Θs C that maximizing (x sl ,y sl )∼S l logP (x sl |y sl ; Θ s C ) to implement the debiasing SSL model learning. Θs C = arg max Θ s C (x sl ,y sl )∼S l logP (x sl |y sl ; Θ s C ) = arg max Θ s C (y sl ,x sl )∼S l logP (y sl |x sl ; Θ s C ) S IP W (.) To make practical use of this Eq. 20, we estimate P (Y ; Θ s C , B s , t) in each mini-batch training for error backpropagation at iteration t with batch size B s . It is noteworthy that we use a distribution moving strategy over all the iterations to reduce the high-variance estimation between time adjacent epochs. The details as below: Bs i=1 (M o • P (y (i) sl ; Θ s C , B s , t -B s ) + (1 -M o ) • P (y (i) sl ; Θ s C , B s , t)) → Bs i=1 P (y (i) sl ; Θ s C , B s , t) ) where B t is the batch size and M o is a momentum coefficient, P (•) is the re-estimated prior. With the gradual removal of bias from the training process, the performance gap between classes also shrinks, and both popular and rare classes can be fairly treated.

8.5. OBJECTIVE FUNCTION OF CDL

The full objective function of CDL has three parts, supervised loss L s (•; Θ C ), unsupervised loss L u (•; Θ C ) and self-penalization loss L sp (•; Θ C ): min Θ C L cdl (•; Θ C ) = E[λ s • L s (•; Θ C ) + λ u • L u (•; Θ C ) + λ sp • L sp (•; Θ C )], Θ C ∈ {Θ s C , Θ t C } (28) where λ s , λ u and λ sp denote the pre-defined hyper-parameters. and Office-Homefoot_5 , respectively. We train CAKE with a standard stochastic gradient descent (SGD) Bottou (2010) optimizer in all experiments. We follow Saito et al. (2019) to replace the last linear layer with a K-way cosine classifier (e.g., K = 126 for DomainNet) and train it at a fixed temperature (0.05 in all our setting). Besides, we use an identical set of hyperparameters (B=24, M o =0.9, L r , τ =0.5, T max =20,000)foot_6 across all datasets. We utilize the Mixmatch Berthelot al.

9. EXPERIMENTAL SETTINGS

(2019)foot_7 as the semi-supervised learning model, the basic loss function for CAKE consists of two cross-entropy loss terms: a supervised loss L s applied to labeled data and an unsupervised loss L u . We compare the results of CAKE with a wide range of baselines, including early works and recent SOTA models on this task:

Comparison of

• Baseline is a simplified version of CAKE without causal intervention, bias eliminating mechanism, invariant regularization and self-penalization. 

10. ADDITIONAL EXPERIMENTAL RESULTS

We conducted the additional experiments on two datasets at different aspects (i.e., one-shot setting, larger shot Learning t-SNE and Grad-CAM visualization of invariant causal factors.) to verify the strength of CAKE. One-shot Setting. We report the comparison with baselines in the one-shot setting on DomainNet in Table 9 and Office-Home in Table 13 . CAKE outperforms the SOTA methods by 1.8% and 2.3% on DomainNet (ResNet-34) and Office-Home (VGG-16), respectively. The performance of CAKE for one-shot learning is better than the three-shot setting, which suggests the almost best accuracy are obtained (except for P→R on DomainNet, A→C and C→A on Office-Home.). As shown later, we also employ the Resnet-34 as backbone to compare the SOTA CDAC Li et al. (2021b) in Table 15 . CDAC accuracy is much lower compared with our CAKE on one&three shot setting. These observations demonstrate the robustness and generalizability of proposed CAKE once again. Larger Shot Learning. We provide 10,20,50-shot SSDA results on DomainNet (R → C) in Table 16 . We randomly select and add additional samples per class from the target domain to the target labeled pool. The implementation details are the same as those of 1,3-shot. From this table, CAKE's performance improved along with more shots and can outperform baselines from 10-to 50-shot settings, which maintains remarkable results consistently. IRM vs. ICL. As pointed out in Sec., IRM is is not sufficient to ensure reduced discrepancy across domains. To validate our point, we report the experimental results of IRM vs. ICL (Table 14 ) and further analyses to shed light on the point. According to 



Note that any distance measure on distributions can be used in place of the Kullback-Leibler (KL) divergenceVan Erven & Harremos (2014) B, Mo,Lr and Tmax refer to batch size, momentum, learning rate and max iteration in SGD optimizer. The MI and MC are orthogonal to other advanced style changing and SSL methods to boost SSDA further. https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix https://anonymous.4open.science/r/Cake-A1B0 http://ai.bu.edu/M3SDA/ https://www.hemanthdv.org/officeHomeDataset.html B, Mo, Lr and Tmax refer to batch size, momentum, learning rate and max iteration in SGD optimizer. https://github.com/YU1ut/MixMatch-pytorch



Figure 1: (a) Four DA cases ("Clipart" → "Real"). (b) Class-wise distribution of source domain and target domain. (c) A simplified version that indicates how our proposed model facilitates the SSDA.

Figure 2: Causal graph of DA.

Disentangled Variables). Data X can be disentangled into concept C, cross-domain style S C and intra-domain style S I variables which are mutually independent, i.e., X = (C, S C , S I ), where C ⊥ ⊥ S C ⊥ ⊥ S I . Only concept C is relevant for the true label Y of X , i.e., style changing is concept-preserving. Under this assumption, we abstract the DA problem into a causal graph (Figure 2 ). In this figure, D represents the Domain (e.g., D S or D T ), while S I (e.g., different appearance of concept in same domain) and S C (e.g., different background of concept cross-domain) are the nuisance variables that confound Y. The absence of any style changing is irrelevant for true label Y. C is the invariant concept which contains directly causal relationships with Y . Therefore, the causal graph reveals the fundamental reasons for distinguishing issues across domains, i.e., the cross/intra-domain style serves as the confounding variables that influence the X → Y. P (Y|C, D = D S ) = P (Y|C, D = D T ) and P (Y|S, D = D S ) ̸ = P (Y|S, D = D T ) =⇒ P (Y|X , D = D S ) ̸ = P (Y|X , D = D T ), ∀S ∈ {S C , S I },

DATASET AND SETTING Benchmark Datasets. DomainNet is originally a multi-source domain adaptation benchmark. Following Saito et al. (2019) in its use for SSDA evaluation, we only select 4 domains, which are Real, Clipart, Painting, and Sketch (abbr. R, C, P and S), each of which contains images of 126 categories. Office-Home Venkateswara et al. (2017) benchmark contains 65 classes, with 12 adaptation scenarios constructed from 4 domains (i.e., R: Real world, C: Clipart, A: Art, P: Product). Office Saenko et al. (2010) is a relatively small dataset contains three domains including DSLR, Webcam and Amazon (abbr. D, W and A) with 31 classes.

Comparison of Methods. For quantifying the efficacy of the proposed framework, we compare CAKE with previous SOTA SSDA approaches, including MME Saito et al. (2019), DANN Ganin et al. (2016), BiAT Jiang et al. (2020), APE Kim & Kim (2020), DECOTA Yang et al. (2021), CDAC Li et al. (2021b) and SSSD Yoon et al. (2022). More details of baselines are in the appendix.

(a) and (b) show the comparison between CAKE and baseline with respect to the top-1-accuracy, accuracy and number pseudo-labels on DomainNet (Real → Clipart). Subscript S and T represent the learned trained on source domain D S or target domain D T . During the learning iteration, we observe that the accuracy of CAKE increases much faster and smoother

Figure 3: Analysis of cross-domain data utilization and debiasing mechanism of CAKE. (a) and (b) depict the top-1-accuracy and correct pseudo-labels of CAKE and baseline within the first 200K iterations. (c) Cake's sensitivity to pseudo-label threshold τ . (d) demonstrates the class-wise accuracy for head and tail classes in dataset produced by CAKE (w/o)/(w/) confirmation bias.

Figure 4(b) visualizes the most influential part in prediction generated from Grad-CAM Selvaraju et al. (2017). It's rather clear to see that CAKE appropriately captures the invariant part of the concept while CAKE(w/o CI) failed. We also analyze the reason why CAKE performs better in these cases. For instance, the concept C="celling fan" has a complicated background, i.e., style S="cluttered". Without causal intervention, CAKE (w/o ICL) tends to focus

Figure 4: In-depth analysis of CAKE. (a) is the plot of invariant causal factor number N g against accuracy(%). (b) Grad-CAM results of CAKE and CAKE(w/o ICL). (c) t-SNE plot of features.

Semi-supervised Domain Adaptation. Semi-supervised domain adaptation (SSDA)Saito et al. (2019);Qin et al. (2020);Jiang et al. (2020);Li & Hospedales (2020);Kim & Kim (2020);Li et al.  (2021b);Yoon et al. (2022) address the domain adaptation problem where some target labels are available. However, these techniques mainly rely on the two domain shift assumptions of Covariate Shift and Conditional Shift to conduct SSDA. Such assumptions present intuitive solutions but lack a solid theoretical explanation for the effectiveness of SSDA, which hinders their further development. Thus we develop the CAKE, which decomposes the SSDA as two proxy subroutines with causal theoretical support and reveals the fundamental reason of the two domain shift assumptions.Invariant Risk Minimization. Recently, the notion of invariant prediction has emerged as an important operational concept in the machine learning field, calledIRM Rosenfeld et al. (2020);Arjovsky et al. (2019). IRM proposes to use group structure to delineate between different environments where the aim is to minimize the classification loss while also ensuring that the conditional variance of the prediction function within each group remains small. In DA, this idea can be studied by learning classifiers that are robust against domain shiftsLi et al. (2021a)  but still has the Covariate Shift issue. Therefore, we propose the CAKE that enforces the model to learn the local disentangled invariant-concepts rather than the global invariant-features across domains, thus facilitating the SSDA.Causality in DA.There are some causality study in DA community. Glynn & Quinn (2010) considered domain adaptation where both the distribution of the covariate and the conditional distribution of the target given the covariate change across domains.Gong et al. (2016) consider the target data causes the covariate, and an appropriate solution is to find conditional transferable components whose conditional distribution given the target is invariant after proper location-scale transformations, and estimate the target distribution of the target domain. Different from the two causal DA handle the DA task that only deals with the Conditional Shift issue, we also consider the Covariate Shift, which presents a improved IRM view for SSDA.

,y sl )∼S l logP (y sl |x sl ; Θs C ) to maximizing arg max Θ s C (x sl ,y sl )∼S l logP (x sl |y sl ; Θ s C ) according Eq. 13. This formula can be understood as using the prior knowledge of marginal distribution P (Y; Θ s C ) to adjust the optimization objectives for unbiased learning. Thus, the SSL model M s C is unbiased to the class-wise sample by maximizing (x sl ,y sl )∼S l logP (x sl |y sl ; Θ s C ), thereby eliminating the undesirable confirmation bias. 7.4 FURTHER DISCUSSION OF ASSUMPTION 1

Figure 5: Overview of CAKE. In ICL proxy, M I (•; Θ I ) first learns the causal factors for D S and D T in unsupervised learning paradigm, aiming to generate the invariant causal factors and use causal intervention to remove the confounding effect. In CDL aspect, we construct two pseudo labeling-based SSL techniques: (S l , Ŝl ) → T u and (T l , Tl ) → T u , aiming at utilizing all the training data possible to bridge the feature discrepancy under the premise of invariant "concept" learning.

Methods. For quantifying the efficacy of the proposed framework, we compare CAKE with previous SOTA SSDA approaches, including MME Saito et al. (2019), DANN Ganin et al. (2016), BiAT Jiang et al. (2020), APE Kim & Kim (2020), DECOTA Yang et al. (2021), ELP Inoue et al. (2018), CDAC Li et al. (2021b) and SSSD Yoon et al. (2022). We also present a simplified version of CAKE as the baseline.

r.t S C and S I : Cross-domain Causal Factor. Ŝt l are generated by N g GAN-based techniques Creswell et al.

Accuracy(%) comparison on DomainNet under the settings of 3-shot using Resnet34 as backbone networks. A larger score indicates better performance. Acronym of each model can be found in Section 4.1. We color each row as the best , second best , and third best . M C (•; Θ C )=Mixmatch Berthelot et al. (2019)) 4 across all datasets.

Ablation study that showcases the impact of individual module.

Results of cooperation vs. solo.

Abbreviations and symbols used in the main paper.

arg max



Abbreviations and the symbols used in the main paper.

Implementation Details. Our work can be checked at the Anonymous Link 6 , . Algorithm 1 and 2 presents the pseudocode of training and inference process of CAKE. We use the PyTorchPaszke et al. (2019) deep learning framework to conduct all our experiments on 8× V100 GPUs and 8× 2080Ti GPUs. We employ the ResNet-34He et al. (2016) and VGG-16Simonyan & Zisserman (2014) (We also report the results of ResNet-34 on Office-Home ) as the backbone model on DomainNet 7

Complete list of hyper-parameters.

•MME Saito et al. (2019)  first proposed to solve SSDA by aligning the features from both domains by means of adversarial learning.•DANN Ganin et al. (2016)  augmented the model with few standard layers and a new gradient reversal layer based on the features that cannot be discriminated between the source and target domains. • ELP Inoue et al. (2018) designed a framework with domain transfer and pseudo labeling to generate instance-level annotations for the target domain. • BiAT Jiang et al. (2020) devised a bidirectional strategy with an adaptive adversarial model and an entropy-penalized virtual adversarial model to guide the direction of generating adversarial examples. • APE Kim & Kim (2020) addressed the intra-domain discrepancy issue via attraction, perturbation, and exploration schemas. • DECOTA Yang et al. (2021) decomposed SSDA into an SSL and UDA problem as two models to bridge the gap and exchange expertise between the source and target domains. • CDAC Li et al. (2021b) developed an adversarial adaptive clustering loss to guide the model training towards grouping the features of unlabeled target data into clusters and further performing cluster-wise feature alignment across domains. Training data S l from source Domain D S , T l and T u from target domain D T , pre-trained classifiers M C (•; Θ s C ) and M C (•; Θ t C ) with parameters Θ s C and Θ t C , respectively; 2 Output: Invariant causal factors of Ŝl , Tl , Tu , fine-tuned classifier M C (•; Θ t C ); 3 Initialization: Randomly initialize the parameters {Θ k st } Train cross domain ICFGs {G st } k i=1 with unpaired S l , T l and T u with cross-domain style transfer loss L st ;Randomly sample a minibatch from {S l , Ŝl } and {T l , Tl }; Calculate ICL loss of L cls and L ir using Eq.25; Measure deconfounded effect using Eq. 2; 6 return ŷtu ← argmax P (Y|do(X = x tu ), D = D T ).

Accuracy on Office-Home (%) for three-shot setting with 4 domains, using VGG-16. A larger score indicates better performance. We color each row as the best , second best , and third best .

Accuracy(%) comparison on Office under the settings of 1-shot and 3-shot using Alexnet as backbone networks.

Accuracy(%) comparison on DomainNet under the settings of one-shot using Resnet34 as backbone networks.

Accuracy(%) comparison of UDA and SSDA of proposed CAKE on DomainNet under the settings of three-shot using Resnet34 as backbone networks. UDA SSDA R to C R to P P to C C to S S to P R to S P to R Mean Accuracy

Accuracy(%) comparison of one-stage CAKE and baselines on DomainNet under the settings of 3-shot using Resnet34 as backbone networks.

Accuracy on Office-Home for oneand three shot using ResNet-34. To further evaluate the effectiveness of the proposed ICL proxy, we tested a variety of ablation models: (1) CAKE w/o CDL, (2) CAKE w/o ICL (3) CAKE w/o IPW. From Table13, one could observe that the CAKE with UDA setting also obtains comparable results compared with SSDA models. This table also suggests the ICL and IPW are both useful to boost SSDA performance, which contributes to the accuracy of 2.3 and 2.2. These observations further demonstrate the robustness and generalizability of the proposed ICL and IWP once again.

Results on DomainNet (R → C) at 10, 20, 50-shot setting, using ResNet-34.Analysis of Class Imbalance. Table12summarizes the comparison of generated pseudo-labels of CAKE and CAKE (w/o IWP) on DomainNet. As reported in Table12, under the imbalanced labeled and unlabeled data, Our CAKE can generate more pseudolabels with higher accuracy, both in tail class data and head class data. In contrast, CAKE w/o IWP generates few correct pseudo-labels, especially for the tail class data. These results empirically verified the robustness of the debiasing mechanism that can generate a more accurate and balanced result. Such mechanism provides a reliable solution that guarantees the mutual data knowledge exchanging from source and target aspects.

ICL-based CAKE outperform IRMbased approaches by a large margin of mean accuracy (DomainNet:IRM +22.3% and LIRR:20.9%, OfficeHome, IRM: +10.3% and LIRR +7.5%). IRM asserts that it is critical to minimize the

annex

discrepancy between conditional distributions over the features. However, IRM is not sufficient to ensure reduced feature discrepancy across domains and it makes the model rely more on spurious correlations (style → label). Instead, ICL not only considers the invariant risk, but also models the invariant concept by eliminating the confoundin effect of spurious correlations, which further acknowledges the importance of ICL. 2017) to further examine the invariant concept learning. Figure 6 shows CAKE learns the invariant concept and attends to the similar image pixels in these samples ( same concept with different styles) , e.g., foreground object shape semantics. For instance, in the "Castle" example, the styles of the four ICFs are drastically different from the original "Castle" image. Nevertheless, benefitting from the causal interventions, our CAKE can distinguish the invariant "concept" features across domains and ignores the changing of the "style", which yields improved generalization guarantees. Investigation for the Multi-object Scenario. Figure 9 systematically shows the robustness of our proposed invariant concept learning when there are multiple objects in an image. For instance, in the first case (Real → Clipart) with a dog and a cat, our CAKE can attend to the most influential pixels on one concept when the classification result is "dog/cat". However, CAKE (w/o ICL) focuses on the two animals, even if one of them is the obviously extraneous factor. The truth is that the irrelevant animal plays a critical context in the prediction process for the trained classifier CAKE (w/o ICL). In other words, this extraneous animal is a part of the invariant context about the key concept, which may downplay the key concept features simply. In contrast, our CAKE is trained from a complex and changeable context with the invariant concept, so it can attend to the appropriate and unique concept even if there are two objects in the image. This provides a more reliable explanation of the proposed assumption once again.

Real → Clipart

Clipart → Real 17 , generating these ICFs of cross-domain style changing samples requires around 12 days, mainly due to the large-scale of the benchmark dataset of DomainNet. Although unfavorable, this paper focuses on the importance of the causal inference for the SSDA task. We believe that introducing the causal theoretical view into SSDA can provide new insights to the domain adaptation community. Moreover, we found that the ICFs of cross-domain style samples both preserved the invariant concept shape with different styles. We believe that an alternative way is only to use the simple image augmentation methods (e.g., image transformation of brightness, temperature and sharpness) to generate ICFs to measure the deconfounded effect. As shown in 

