INFORMATION-THEORETIC ANALYSIS OF UNSUPER-VISED DOMAIN ADAPTATION

Abstract

This paper uses information-theoretic tools to analyze the generalization error in unsupervised domain adaptation (UDA). We present novel upper bounds for two notions of generalization errors. The first notion measures the gap between the population risk in the target domain and that in the source domain, and the second measures the gap between the population risk in the target domain and the empirical risk in the source domain. While our bounds for the first kind of error are in line with the traditional analysis and give similar insights, our bounds on the second kind of error are algorithm-dependent, which also provide insights into algorithm designs. Specifically, we present two simple techniques for improving generalization in UDA and validate them experimentally. Unless otherwise noted, a random variable will be denoted by a capitalized letter, and its realization is denoted by the corresponding lower-case letter. Consider a prediction task with instance space Z = X × Y, where X and Y are the input space and the label (or output) space, respectively. Let F be the hypothesis space of interest, in which each f ∈ F is a function or predictor mapping X to Y. We assume that each hypothesis f ∈ F is parameterized by some weight parameter w in some space W and may write f as f w as needed.

1. INTRODUCTION

This paper focuses on the unsupervised domain adaptation (UDA) task, where the learner is confronted with a source domain and a target domain and the algorithm is allowed to access to a labeled training sample from the source domain and an unlabeled training sample from the target domain. The goal is to find a predictor that performs well on the target domain. A main obstacle in such a task is the discrepancy between the two domains. Some recent works (Ben-David et al., 2006; 2010; Mansour et al., 2009; Zhao et al., 2019; Zhang et al., 2019; Shen et al., 2018; Germain et al., 2020; Acuna et al., 2021; Nguyen et al., 2022) have proposed various measures to quantify such discrepancy, either for the UDA setting or for the more general domain generalization tasks, and many learning algorithms are proposed. For example, Nguyen et al. (2022) uses a (reverse) KL divergence to measure the misalignment of the two domain distributions, and motivated by their generalization bound, they design an algorithm that penalizes the KL divergence between the marginal distributions of two domains in the representation space. Despite that this "KL guided domain adaptation" algorithm is demonstrated to outperform many existing marginal alignment algorithms (Ganin et al., 2016; Sun & Saenko, 2016; Shen et al., 2018; Li et al., 2018) , it is not clear whether KL-based alignment of marginal distributions is adequate for UDA, and more fundamentally, what role the unlabelled target-domain sample should play in cross-domain generalization. Notably, most UDA algorithms are heuristically designed and intuitively justified. Moreover, most existing generalization bounds are algorithm-independent. Then there appears significant room for both deeper theoretical understanding and more principled algorithm design. In this paper, we analyze the generalization ability of hypotheses and learning algorithms for UDA tasks using an information-theoretic framework developed in (Russo & Zou, 2016; Xu & Raginsky, 2017) . The foundation of our technique is the Donsker-Varadhan representation of KL divergence (see Lemma A.1) . We present novel upper bounds for two notions of generalization errors. The first notion ("population-to-population (PP) generalization error") measures the gap between the population risk in the target domain and that in the source domain for a hypothesis, and the second ("expected empirical-to-population (EP) generalization error") measures the gap between the population risk in the target domain and the empirical risk in the source domain for a learning algorithm. We show that the PP generalization error for all hypotheses are uniformly bounded by a quantity governed by the KL divergence between the two domain distributions, which, under bounded losses, recovers the the bound in Nguyen et al. (2022) . We then show that this KL term upper-bounds some other measures including Total-Variation distance (Ben-David et al., 2006) , Wasserstein dis-tance (Shen et al., 2018) and domain disagreement (Germain et al., 2020) . Thus, minimizing KLdivergence forces the minimization of other discrepancy measures as well. This, together with the ease of minimizing KL (Nguyen et al., 2022) , explains the effectiveness of the KL-guided alignment approach. For expected EP generalization error, we develop several algorithm-dependent generalization bounds. These algorithm-dependent bounds further inspire the design of two new and yet simple strategies that can further boost the performance of the KL guided marginal alignment algorithms. Experiments are performed to verify the effectiveness of these strategies.

2. RELATED WORK

Domain Adaptation Many domain adaptation generalization bounds have been developed (Ben-David et al., 2006; 2010; David et al., 2010; Mansour et al., 2009; Shen et al., 2018; Zhang et al., 2019; Germain et al., 2020; Acuna et al., 2021) , and various discrepancy measures are introduced to derive these bounds including total variation (Ben-David et al., 2006; 2010; David et al., 2010; Mansour et al., 2009) , Wasserstein distance (Shen et al., 2018) , domain disagreement (Germain et al., 2020) and so on. In particular, bounds based on H∆H in Ben-David et al. (2010) are restricted to a binary classification setting and assume a deterministic labeling function. Furthermore, Ben-David et al. (2010) also assumes the loss is the L 1 distance between the predicted label and true label (which is bounded). Our bounds work for the general supervised learning problems with any labelling mechanism (e.g., stochastic labelling), and we do not require the specific choice of the loss (even unbounded). Recently, Shui et al. (2020) proposed generalization bounds using Jensen-Shannon (JS) divergence, which bear a relation to our Corollary 4.2. While other algorithm-dependent bounds have been proposed for different transfer learning settings (e.g., Wang et al. (2019) ), they are not directly comparable to our own bounds. For more details about the domain adaptation theory, we refer readers to Redko et al. (2020) for a comprehensive survey. In addition, the most common methods for domain adaptation involve aligning the marginal distributions of the representations between the source and target domains, for example, using an adversarial training mechanism (Ganin et al., 2016; Shen et al., 2018; Acuna et al., 2021) or aligning the first two moments of the representation distribution (Sun & Saenko, 2016) . There are numerous other domain adaptation algorithms, and we refer readers to (Wilson & Cook, 2020; Zhou et al., 2021; Wang et al., 2021b) for recent advances. Information-Theoretic Generalization Bounds Information-theoretic analysis is usually used to bound the expected generalization error of supervised learning, where the training and testing data come from the same distribution (Russo & Zou, 2016; 2019; Xu & Raginsky, 2017; Bu et al., 2019; Negrea et al., 2019; Steinke & Zakynthinou, 2020; Rodríguez Gálvez et al., 2021) . Exploiting the chain rule of mutual information, these bounds are successfully applied to characterize the generalization ability of stochastic gradient based optimization algorithms (Pensia et al., 2018; Negrea et al., 2019; Haghifam et al., 2020; Wang et al., 2021a; Neu et al., 2021; Wang & Mao, 2022a; b) . Recently, this framework has also been used in other learning settings including meta-learning (Jose & Simeone, 2021a; Jose et al., 2021; Rezazadeh et al., 2021; Chen et al., 2021) , semi-supervised learning (He et al., 2021; Aminian et al., 2022) and transfer learning (Wu et al., 2020; Jose & Simeone, 2021a; b; Masiha et al., 2021; Bu et al., 2022) . In particular, (Wu et al., 2020; Jose & Simeone, 2021b) consider a different problem setup with ours. Specifically, their expected generalization error is the gap between the target population risk and a weighted empirical risk combining both the source and the target empirical risks, while our "EP" error is the gap between the target population risk and the source empirical risk. That is, we focus on the role of the unlabelled target data in cross-domain generalization when the source empirical risk is taken as a training objective, whereas their works assume the existence of labelled target data and study their role in domain adaptation. Let µ and µ ′ be two distributions on Z, unknown to the learner, where µ characterizes the source domain and µ ′ characterizes the target domain. We may also write µ as P Z or P XY and µ ′ as P Z ′ or P X ′ Y ′ , which defines random variables Z = (X, Y ) and Z ′ = (X ′ , Y ′ ), respectively. Let S = {Z i } n i=1 ∼ µ ⊗n be a labeled source-domain sample and S ′ X ′ = {X ′ j } m j=1 ∼ P ⊗m X ′ be an unlabelled target-domain sample. The objective of UDA is to design an algorithm A that takes S and S ′ X ′ as the input and outputs a weight W ∈ W, giving rise to a predictor f W ∈ F that "works well" on the target domain. Note that the algorithm A is characterized by a conditional distribution P W |S,S ′ X ′ . Let ℓ : Y × Y → R + 0 be a loss function. The population risk for each w ∈ W in the target domain is defined as R µ ′ (w) ≜ E Z ′ [ℓ(f w (X ′ ), Y ′ ) ] and a good UDA algorithm hopes to return a weight w that minimizes this risk. Since µ ′ is unknown, one often uses recourse to the empirical risk in the source domain, defined as R S (w) ≜ 1 n n i=1 ℓ(f w (X i ), Y i ). Generalization error in this setting measures how well the hypothesis returned from the algorithm generalizes from the source-domain training sample to the target-domain unknown distribution µ ′ . Taking into account the stochastic nature of the algorithm A, a natural notion of generalization error for UDA can be defined by Err ≜ E W,S [R µ ′ (W ) -R S (W )] = E W,S,S ′ X ′ [R µ ′ (W ) -R S (W )], where the expectation in the first expression is taken over the joint distribution of (W, S) ∼ P W |S × µ ⊗n , and the expectation of the second expression is taken over the joint distribution of (W, S, S ′ X ′ ) ∼ P W |S,S ′ X ′ × µ ⊗n × P ⊗m X ′ . There is another notion of generalization error, more traditional in the domain adaptation literature, defined as the gap between the population risk in the target domain and that in the source domain: Err(w) ≜ R µ ′ (w) -R µ (w). ( ) where R µ (w) ≜ E Z [ℓ(f w (X), Y )]. It is apparent that Err(w) and Err are related by the following triangle inequality: |R µ ′ (w) -R S (w)| ≤ |R µ ′ (w) -R µ (w)| + |R µ (w) -R S (w)|. where the second term on the right hand side is the standard generalization error in the source domain, which can be bounded by classical learning-theoretic tools, e.g., Rademacher complexity (Bartlett & Mendelson, 2002) . Thus, bounding Err(w) helps bounding Err. This paper studies both notions of generalization error for UDA. Specifically, starting from Section 5, we will mainly use information-theoretic tools to bound Err directly, without going through Err(w). For the ease of reference, we refer to Err(w) as the population-to-population (PP) generalization error for w and Err as the expected empirical-to-population (EP) generalization error. The following definitions are useful. Definition 3.1 (Disintegrated Mutual Information). Let X, Y and Z be random variables and z be a realization of Z. The disintegrated mutual information of X and Y given Z = z is I z (X; Y ) ≜ D KL (P X,Y |Z=z ||P X|Z=z P Y |Z=z ). Note that the conditional mutual information I(X; Y |Z) = E Z I Z (X; Y ). Definition 3.2 (Lautum Information (Palomar & Verdú, 2008) ). The lautum information between X and Y is defined as L(X; Y ) ≜ D KL (P X P Y ||P XY ).

4. UPPER BOUNDS FOR PP GENERALIZATION ERROR

We now present some upper bounds for Err(w). The key techniques used in developing these bounds are the information-theoretic tools in the style of Lemma A.1. These bounds adopt certain KL divergence to measure the discrepancy between the source and target domains. Notably, some previously established bounds are recovered under weaker conditions. Additionally, we demonstrate that under certain conditions, the KL-based bound is an upper bound of several other discrepancy measures and hence minimizing the KL divergence forces the minimization of these other measures. We first list some common assumptions on the loss function, which we consider in this paper. Notably this result can be turned into a generalization upper bound providing guidance to algorithm design, and at the same time it provides a lower bound of the generalization error, highlighting some fundamental difficulty of the learning task. To illustrate this, we present a corollary while noting that similar development can also be applied to other bounds presented later in this paper. Consider that each f w is expressed as the composition g • h, where h is a function mapping X to a representation space T and g is a function mapping T to Y. For any given h : X → T , denote by µ h the distribution on T × Y obtained by pushing forward µ via h, that is, µ h (t, y) = δ(t -h(x))dµ(x, y), where δ is the Dirac measure on T . Similarly, let µ ′ h denote the distribution on T × Y obtained by pushing forward µ ′ via h. Corollary 4.1. Suppose that f w = g • h and that Assumption 2 holds, then for any w ∈ W, R µ (w) -2R 2 D KL (µ ′ ||µ) ≤ R µ ′ (w) ≤ R µ (w) + 2R 2 D KL (µ ′ h ||µ h ). In this result, the lower bound of R µ ′ (w) indicates a fundamental difficulty in UDA learning in that, using the same predictor mapping f w , there is no way for the population risk in the target domain to be lower than that of the source domain less than a constant which depends only on the domain difference. On the other hand, the upper bound suggests that it is possible to squeeze the gap between the two population risks by choosing an appropriate representation map h -evidently such a map should be attempting to align µ ′ h with µ h or to align their respective proxies. It is also noteworthy that under Assumption 1 and due to Remark 4.1, Theorem 4.1 implies Err(w) ≤ M √ 2 D KL (P X ′ ||P X ) + D KL (P Y ′ |X ′ ||P Y |X ). Similarly applying this result in the representation space T , we see that Eq. (3) recovers the bound in Proposition 1 of Nguyen et al. (2022) . Notice that unlike Nguyen et al. (2022) , Theorem 4.1 ( or Eq. (3)) does not require the loss to be the cross-entropy loss. (Jeffreys, 1946) , and in fact, Nguyen et al. (2022) penalizes this measure between the source and target distributions in the representation space. Notice that bounds in Shui et al. (2020) are based on the JS divergence. Since there is a sharp upper bound of the JS divergence based on Jeffrey's divergence (Crooks, 2008) , minimizing Jeffrey's divergence (in the representation space) will simultaneously penalize the JS divergence. In UDA, since Y ′ is completely unavailable to the algorithm A, it is impossible to minimize the misalignment of conditional distributions, i.e. D KL (P Y ′ |T ′ ||P Y |T ) where T and T ′ are representations of source domain and target domain, respectively. A common method is to assign pseudo labels to target data based on a learned source classifier (Liang et al., 2020) . However, it may also cause additional issues (Shen et al., 2022) . For concreteness, suppose the trained model Q can well approximate the real mapping between X and Y on source domain (i.e. Q Y |T = P Y |T ), which is usually the training objective. Let Ŷ ′ be the pseudo label of T ′ generated by the trained model, i.e., Q Ŷ ′ |T ′ = Q Y |T . Let Q T ′ , Ŷ ′ = P T ′ Q Ŷ ′ |T ′ , then the following holds, D KL (P T ′ ,Y ′ ||P T,Y ) = E P T ′ ,Y ′ log P T ′ ,Y ′ Q T ′ , Ŷ ′ Q T ′ , Ŷ ′ P T,Y = D KL (P T ′ ||P T ) + D KL (P Y ′ |T ′ ||Q Ŷ ′ |T ′ ). (4) For a specific t ′ , if P (Y ′ = y ′ |T ′ = t ′ ) ̸ = 0 and Q( Ŷ ′ = y ′ |T ′ = t ′ ) = 0, then the second term in RHS of Eq. (4), D KL (P Y ′ |T ′ ||Q Ŷ ′ |T ′ ) → ∞. In this case, even when the marginal distributions are perfectly aligned, the overall value of the upper bound is large. Thus, incorrect pseudo labels may even have negative impact on the target domain performance. In fact, the misalignment of the conditional distributions appears to be the main difficulty of UDA (Ben-David et al., 2006; Acuna et al., 2021) . The next corollary suggests that this difficulty may be alleviated when the loss function satisfies the triangle property, namely, Assumption 4. It can be verified that this assumption is satisfied by the 0-1 lossfoot_2 ; this assumption has also been considered in previous works (Mansour et al., 2009; Shen et al., 2018) . Theorem 4.2. If Assumption 4 holds and let ℓ(f w ′ (X), f w (X)) be R-subgaussian for any w, w ′ ∈ W. Then for any w, Err(w) ≤ 2R 2 D KL (P X ′ ||P X )+λ * , where λ * = min w∈W R µ ′ (w)+R µ (w). Here λ * measures the possibility of whether the domain adaptation algorithm will succeed under the oracle knowledge of µ and µ ′ . In particular, if the hypothesis space is large enough, the minimizer w * for the "joint population risk" R µ ′ (w) + R µ (w) may give rise to R µ ′ (w * ) = R µ (w * ) = 0, then we're likely to generalize well on the target domain. Then the KL divergence D KL (P X ′ ||P X ) between the two X -marginals alone bounds the PP generalization error uniformly for all w ∈ W. This theorem motivates the strategy of penalizing D KL (P T ′ ||P T ) in the representation space for UDA. The next theorem suggests that such an approach also penalizes other notions of domain discrepancy, for example, the key quantity in the PAC-Bayes type of domain adaptation generalization bounds (Germain et al., 2020) , that is defined as dis(P X , P X ′ ) ≜ |E W,W ′ [E X ′ [ℓ(f W (X ′ ), f W ′ (X ′ ))]] -E W,W ′ [E X [ℓ(f W (X), f W ′ (X))]]| . (5) Theorem 4.3. If ℓ(f w ′ (X), f w (X)) is R-subgaussian for any f w , f ′ w ∈ F, then dis(P X , P X ′ ) ≤ 2R 2 D KL (P X ′ ||P X ). Note that unlike Germain et al. (2020) , here we do not require the loss function to be the 0-1 loss.

4.2. GENERALIZATION BOUNDS VIA THE LIPSCHITZ CONDITION

We now present such generalization bound for UDA under the Lipschitz continuity assumption of the loss function, where W(•, •) denotes the Wasserstein distance. Theorem 4.4. If Assumption 3 holds, then Err(w) ≤ βW(µ ′ , µ). Theorem 4.4 can be related to the KL-based bounds in the previous section when the Wasserstein distance is defined with respect to the discrete metric d. In this case and under bounded loss function, which is also Liptschitz continuous, Theorem 4.4 follows. On the other hand, Wasserstein distance is also equivalent to the total variation in this case, while the latter is connected to the KL divergence via Pinsker's inequality (Polyanskiy & Wu, 2019, Theorem 6.5 ) and the Bretagnolle-Huber inequality (Bretagnolle & Huber, 1979, Lemma 2.1) . Thus, we arrive at the following result. Corollary 4.3. If Assumption 1 holds holds and let d be the discrete metric, then Err(w) ≤ M TV(µ ′ , µ) ≤ M min 1 2 D KL (µ ′ ||µ), 1 -e -DKL(µ ′ ||µ) . Note that results here are inspired by the work of Rodríguez Gálvez et al. ( 2021). Corollary 4.3 provides a tighter bound than the one in Eq. ( 3), as can be directly verified. Parallel to Theorem 4.2, if the loss function satisfies the triangle property, we may establish the bound below, which recovers a similar result in (Shen et al., 2018 , Theorem 1.) but without restricting the task to be binary classification or requiring the loss to be the L 1 distance. Theorem 4.5. If Assumption 4 holds and ℓ(f w (X), f w ′ (X)) is β-Lipschitz in X for any w, w ′ ∈ W, then for any w ∈ W, Err(w) ≤ βW(P X ′ , P X ) + λ * , where λ * = min w∈W R µ ′ (w) + R µ (w). These results justify the strategy of minimizing domain discrepancy in the representation space. Since the KL-based bounds upper-bound those based on other measures of domain differences, penalizing the KL divergence will also penalize those other measures. This is practically advantageous since it is usually easier and more stable to minimize the KL divergence (Nguyen et al., 2022) .

5. UPPER BOUNDS FOR EP GENERALIZATION ERROR AND APPLICATIONS

There are two limitations in the bounds on the PP generalization error developed so far and in the traditional analysis of UDA. First, such bounds are independent of w and hence algorithm-independent. Second, although these bounds may inspire strategies to exploit the unlabelled target sample, e.g., aligning the source and target distributions in the representation space, they only provide very limited knowledge on the role that the unlabelled target sample plays. Inspired by the works of Negrea et al. (2019) and Rodríguez-Gálvez et al. ( 2021), we derive upper bounds for the EP generalization error that take better advantage of the dependence of the algorithm's output on the unlabelled target data. Applications of these bounds in designing the learning algorithms are also presented.

5.1. EP GENERALIZATION BOUNDS

Theorem 5.1. Assume ℓ(f W (X ′ ), Y ′ ) is R-subgaussian under P W,Z ′ |X ′ j =x ′ j for any x ′ j ∈ X , then |Err| ≤ 1 nm m j=1 n i=1 E X ′ j 2R 2 I X ′ j (W ; Z i ) + 2R 2 D KL (µ||µ ′ ). Remark 5.1. It is worth noting that the unlabelled target data contributes to the first term of the bound. Increasing the amount of source and target data will result in a reduction of the first term in the bound. Specifically, moving the expectation inside the square root function by Jensen's inequality and since Z i ⊥ ⊥ X ′ j , the equations I(W ; Z i |X ′ j ) = I(W ; Z i |X ′ j ) + I(Z i ; X ′ j ) = I(W ; Z i ) + I(X ′ j ; Z i |W ) hold by the chain rule. The term I(W ; Z i ) will vanish as n → ∞ and the term I(X ′ j ; Z i |W ) will also vanish as n, m → ∞. The theorem can be turned into a version that is more practically relevant, in which the KL term is replaced with their representation-space counter-part (following a similar argument used for deriving Corollary 4.1). In addition, note that although larger sample sizes allow better estimation of that KL term, utilizing pseudo-labels for estimation may have a negative impact (as discussed in Section 4), which can be amplified by the larger sample size. Corollary 5.1. Let Assumption 1 hold. Then |Err| ≤ M √ 2nm m j=1 n i=1 E X ′ j min I X ′ j (W ; Z i ), L X ′ j (W ; Z i ) + M √ 2 min {D KL (µ||µ ′ ), D KL (µ ′ ||µ)}. Theorem 5.2. Assume ℓ is Lipschitz for both w ∈ W and z ∈ Z, i.e., |ℓ(f w (x), y)-ℓ(f w (x ′ ), y ′ )| ≤ βd 1 (z, z ′ ) for all z, z ′ ∈ Z and |ℓ(f w (x), y) -ℓ(f w ′ (x), y)| ≤ β ′ d 2 (w, w ′ ) for all w, w ′ ∈ W, then |Err| ≤ β ′ nm m j=1 n i=1 E X ′ j ,Zi W(P W |Zi,X ′ j , P W |X ′ j ) + βW(µ, µ ′ ). This bound is tighter than the bound in Theorem 5.1, as can be indicated by the following corollary. Corollary 5.2. Let Assumption 1 hold. Then Err ≤ M nm m j=1 n i=1 E X ′ j ,Zi TV(P W |Zi,X ′ j , P W |X ′ j ) + M TV(µ, µ ′ ) ≤ 1 nm m j=1 n i=1 E X ′ j ,Zi M 2 2 D KL (P W |Zi,X ′ j ||P W |X ′ j ) + M 2 2 D KL (µ||µ ′ ). Notice that to recover Theorem 5.1 from Corollary 5.2 (under Assumption 1), we can use Jensen's inequality to move the expectation over Z i to inside the square root function.

5.2. GRADIENT PENALTY AS AN UNIVERSAL REGULARIZER

The algorithm-dependent bound in Theorem 5.1 tells us that one can reduce the EP error by limiting the disintegrated mutual information I X ′ j (W ; Z i ). In the stochastic gradient based optimization algorithms, this term can be controlled by penalizing the gradient norm. To see this, we now consider a "noisy" iterative algorithm for updating W , e.g., SGLD. At each time step t, let the labelled mini-batch from the source domain be Z Bt , let the unlabelled mini-batch from the target domain be X ′ Bt , and let g(W t-1 , Z Bt , X ′ Bt ) be the gradient at time t. Thus, the updating rule of W is W t = W t-1 -η t g(W t-1 , Z Bt , X ′ Bt ) + N t where η t is the learning rate and N t ∼ N (0, σ 2 I d ) is an isotropic Gaussian noise. Inspired by Pensia et al. (2018) , we have the following bound. Theorem 5.3. Let the total iteration number be T and let G t = g(W t-1 , Z Bt , X ′ Bt ), then |Err| ≤ R 2 n T t=1 η 2 t σ 2 t E S ′ X ′ ,Wt-1,S G t -E Z B t [G t ] 2 + 2R 2 D KL (µ||µ ′ ). Remark 5.2. Considering a noisy iterative algorithm here is merely for simplifying analysis. In fact, it is also possible to analyze the original iterative gradient optimization method without noise injected. For example, one can follow the same development in (Neu et al., 2021; Wang & Mao, 2022a) to analyze vanilla SGD. In that case, there will be some residual terms in the bound. Theorem 5.3 hints that to reduce the generalization error, one can simply restrict the gradient norm at each step (so that ||G t -E Z B t [G t ]|| 2 is reduced) . This strategy will also restrict the distance between the final output W T and the initialization W 0 , effectively shrinking the hypothesis space accessible by the algorithm. We also note that the importance of gradient penalty has been theoretically justified in the supervised learning setting (Negrea et al., 2019; Haghifam et al., 2020; Smith et al., 2021; Rodríguez-Gálvez et al., 2021; Neu et al., 2021; Wang & Mao, 2022a; b) . Indeed, adding gradient penalty can be applied to any existing UDA algorithm and it is simple but effective in practice. Later on we will show that even when the algorithm A does not access to any target data, in which case I(W ; Z i |X ′ j ) reduces to I(W ; Z i ) and g(W t-1 , Z Bt , X ′ Bt ) becomes g(W t-1 , Z Bt ), minimizing the empirical loss of source domain sample while penalizing gradient norm will still improve the performance. Notice that gradient penalty has been used in standard supervised learning as a regularization technique (Geiping et al., 2022; Jastrzebski et al., 2021) . It is also used in Wasserstein distance based adversarial adaptation (Gulrajani et al., 2017; Shen et al., 2018) , and their motivation is to stabilize the training to avoid gradient vanishing problem. Here we suggest, with strong theoretical justification, that gradient penalty is a universal technique for improving the generalization performance in UDA for any gradient-based learning method. Notably the bound in Theorem 5.3 only depends on the size n of labelled source sample and does not explicitly depend on m, the size of unlabelled target sample. With a more careful design, if we consider the mutual information as the expected KL divergence of a posterior and a prior, based on I X ′ j (W ; Z i ) in Theorem 5.1, it is possible to create a target-data-dependent prior and derive a tighter bound based on some quantity similar to "gradient incoherence" in Negrea et al. (2019) .

5.3. CONTROLLING LABEL INFORMATION FOR KL GUIDED MARGINAL ALIGNMENT

Consider instances in the representation space, Z = (T, Y ) and Z ′ = (T ′ , Y ). Theorem 5.1 also encourages us to align the distributions of two domains in the representation space, as argued earlier. Then the KL guided marginal alignment algorithm proposed in Nguyen et al. (2022) can be invoked here. One may notice that Theorem 5.1 uses D KL (µ||µ ′ ) while Nguyen et al. (2022) uses D KL (µ ′ ||µ). As already discussed in Section 4, this inconsistency can be ignored when the loss is bounded (see Corollary 5.1). Most domain adaptation algorithms aim to align the marginal distributions of two domains in the representation space. However, without accessing to Y ′ , it remains unknown if an UDA algorithm will work well since we cannot guarantee that discrepancy between conditional distribution P Y |T and P Y ′ |T ′ won't become too large when we align the marginals. In Nguyen et al. (2022) , the authors show that D KL (P Y ′ |T ′ ||P Y |T ) can be upper-bounded by D KL (P Y ′ |X ′ ||P Y |X ), if I(X; Y ) = I(T ; Y ). The authors then argue that penalizing the KL divergence of the marginals is safe. We now argue that in practice the condition I(X; Y ) = I(T ; Y ) can be difficult to satisfy if the cross-entropy loss is used to define the source-domain empirical risk. By data processing inequality on Y -X -T , we know that I(X; Y ) ≥ I(T ; Y ) = H(Y )-H(Y |T ). Thus, to let I(T ; Y ) reach its maximum, one must minimize H(Y |T ). On the other hand, let Q Y |T,W be the predictive distribution of labels in the source domain generated by the classifier. The expected cross-entropy loss for each Z i in the representation space is then E W,Zi [ℓ(f W (T i ), Y i )] = E Zi E W |Zi -log Q Yi|Ti,W , which also decomposes as (Achille & Soatto, 2018; Harutyunyan et al., 2020 ) E W,Zi [ℓ(f W (T i ), Y i )] = H(Y i |T i ) + E Ti,W D KL (P Yi|Ti,W ||Q Yi|Ti,W ) -I(W ; Y i |T i ). (6) Then minimizing the expected cross-entropy loss may not adequately reduce H(Y i |T i ) but rather cause I(W ; Y i |T i ) to significantly increase, particularly when the model capacity is large. This may have two negative effects. First, the condition I(X; Y ) = I(T ; Y ) is significantly violated, and D KL (P Y ′ |T ′ ||P Y |T ) is no longer upper bounded by D KL (P Y ′ |X ′ ||P Y |X ). Hence, aligning the two marginals alone may not be adequate. Second, large I(W ; Y i |T i ) indicates W just simply memorizes the label Y i , resulting a form of overfitting and hurting the generalization performance. The key take-away from the above analysis is that when aligning the marginals in UDA, controlling the source label information in the weights can be important to achieve good cross-domain generalization. A similar message can also be deduced from Theorem 5.1, when it is viewed in the representation space and noting I T ′ j (W ; Z i ) = I T ′ j (W ; T i ) + I T ′ j (W ; Y i |T i ). To control label information, Harutyunyan et al. (2020) proposed an approach called LIMIT. However, this method is rather complicated and arguably hard to train in domain adaptation (see Appendix C.8). We now derive a simple alternative strategy for this purpose. Notice that I T ′ j (W ; Y i |T i ) ≤ inf Q E Ti D KL (P W |Yi,Ti,T ′ j =t ′ j ||Q W |Ti,T ′ j =t ′ j ) , which is a simple extension of variational representation of mutual information (Polyanskiy & Wu, 2019, Corollary 3.1.) . Here Q could be any distribution. By assuming P = N (W, σ 2 I d |Y i , T i , T ′ j = t ′ j ) and taking Q = N ( W , σ2 I d |T i , T ′ j = t ′ j ) , we have I T ′ j (W ; Y i |T i ) ≤ inf Q E Ti D KL (P W |Yi,Ti,T ′ j =t ′ j ||Q W |Ti,T ′ j =t ′ j ) ∝ ||W -W || 2 . Thus, we may create an auxiliary classifier f w that is not allowed to access to the real source label Y . In each iteration, we use the pseudo labels of target data (and source data) assigned by f w to train f w and adding ||W -W || 2 as a regularizer in the training of W . The algorithm is given in the Appendix. Remarkably the regularizer here resembles "Projection Norm" designed in Yu et al. (2022) for out-of-distribution generalization. , respectively. We will take the original MNIST dataset (0 • ) as the source domain and take other five domains as target domains. Hence, there are five domain adaptation tasks on RotatedMNIST. Digits consists of three sub-datasets, namely MNIST, USPS (Hull, 1994) and SVHN (Netzer et al., 2011) , and the corresponding domain adaptation tasks are MNIST→USPS (M→U), USPS→MNIST (U→M), SVHN→MNIST (S→M). Compared Methods Baseline methods are some popular marginal alignment UDA methods including DANN (Ganin et al., 2016) , MMD (Li et al., 2018) , CORAL (Sun & Saenko, 2016) , WD (Shen et al., 2018) and KL (Nguyen et al., 2022) . We also choose ERM as another baseline, in which only the source-domain sample is accessible during training. To verify the strategies inspired by our theory, we first add the gradient penalty to the ERM algorithm (ERM-GP), and we then combine gradient penalty (GP) and controlling label information (CL) with the recent proposed KL guided marginal alignment method, which are denoted by KL-GP and KL-CL, respectively. Implementation Details Most of our implementation is based on the DomainBed suite (Gulrajani & Lopez-Paz, 2021) . Other settings exactly follow Nguyen et al. (2022) and the results of baseline methods are taken from Nguyen et al. (2022) . Specifically, each algorithm is run three times and we show the average performance with the error bar. Every dataset has a validation set, and the model selection scheme is based on the best performance achieved on the validation set of target domain during training (oracle). The hype-parameter searching process is also built upon the implementation in the DomainBed suite. Other details and additional experiments can be found in Appendix.

Results

From Table 1 , we first notice that gradient penalty allows ERM to perform more comparably to other marginal alignment methods. For example, on RotatedMNIST, ERM-GP outperforms CORAL and performs nearly the same with DANN. On Digits, ERM-GP outperforms WD. When GP and CL combined with KL guided algorithm, we can see that the performance can be further boosted. This justifies the discussion in Section 5.2 and Section 5.3.

7. CONCLUSION

Despite that the numerous learning techniques have been developed for domain adaptation, significant room exists for more in-depth theoretical understanding and more principled design of learning algorithms. This paper presents the information-theoretic analysis for unsupervised domain adaptation, where we query two notions of the generalization errors in this context and present novel learning bounds. Some of these bounds recover the previous KL-based bounds under different conditions and confirm the insights in the learning algorithms that align the source and target distributions in the representation space. Our other bounds are algorithm-dependent, better exploiting the unlabelled target data, which have inspired novel and yet simple schemes for the design of learning algorithms. We demonstrate the effectiveness of these schemes on standard benchmark datasets. 

A SOME PREREQUISITE DEFINITIONS AND USEFUL LEMMAS

Definition A.1 (Wasserstein Distance). Let d(•, •) be a metric and let P and Q be probability measures on X . Denote Γ(P, Q) as the set of all couplings of P and Q (i.e. the set of all joint distributions on X × X with two marginals being P and Q), then the Wasserstein Distance of order one between P and Q is defined as W(P, Q) ≜ inf γ∈Γ(P,Q) X ×X d(x, x ′ )dγ(x, x ′ ). Remark A.1. Similar to Rodríguez Gálvez et al. ( 2021), here we mainly focus on 1-Wasserstein distance but all the upper bounds based on 1-Wasserstein distance also holds for higher order Wasserstein distance by Hölder's inequality (Cédric, 2008, Remark 6.6 ). Definition A.2 (Total Variation). The total variation between two probability measures P and Q is TV(P, Q) ≜ sup E |P (E) -Q(E)| , where the supremum is over all measurable set E. Remark A.2. Note that the total variation equals to the Wasserstein distance under the discrete metric (or Hamming distortion) d(x, x ′ ) = 1 x̸ =x ′ where 1 is the indicator function (Cédric, 2008, Theorem 6.15) . The key quantity in the most information-theoretic generalization bounds is the mutual information between algorithm's input and output. Specifically, the core technique behind these bounds is the well-known Donsker-Varadhan representation of KL divergence (Polyanskiy & Wu, 2019, Theorem 3.5 ). Lemma A.1 (Donsker and Varadhan's variational formula). Let Q, P be probability measures on Θ, for any bounded measurable function f : Θ → R, we have Acuna et al. (2021) proposed a discrepancy measure called D ϕ H -discrepancy (or D ϕ h,H -discrepancy). As KL divergence belongs to the family of f -divergences and both Acuna et al. (2021) and our work use the variational representation of divergence, there appears to be a connection between our work (in Section 4) and theirs. However, it's important to note that the variational characterization of f -divergence used in Acuna et al. (2021) is based on the results of Nguyen et al. (2010) , while the Donsker-Varadhan representation of KL divergence (see Lemma A.1) used in our paper cannot be directly obtained from their variational characterization (Jiao et al., 2017; Agrawal & Horel, 2020) . In fact, simply choosing x log x as the conjugate function would result in a weaker bound than Lemma A.1. Therefore, while there is some similarity between our results and those of Acuna et al. (2021) , our results in Section 4 cannot be directly derived from theirs. D KL (Q||P ) = sup f E θ∼Q [f (θ)] - log E θ∼P [exp f (θ)]. Remark A.3. Motivated by the classic f -divergence, Similar to Xu & Raginsky (2017, Lemma 1.) , we need the following lemma as a main tool. Lemma A.2. Let Q and P be probability measures on Θ. Let θ ′ ∼ Q and θ ∼ P . If g(θ) is R-subgaussian, then, |E θ ′ ∼Q [g(θ ′ )] -E θ∼P [g(θ)]| ≤ 2R 2 D KL (Q||P ). Proof. Let f = t • g for any t ∈ R, by Lemma A.1, we have D KL (Q||P ) ≥ sup t E θ ′ ∼Q [tg(θ ′ )] -log E θ∼P [exp t • g(θ)] = sup t E θ ′ ∼Q [tg(θ ′ )] -log E θ∼P [exp t(g(θ) -E θ∼P [g(θ)] + E θ∼P [g(θ)])] = sup t E θ ′ ∼Q [tg(θ ′ )] -E θ∼P [tg(θ)] -log E θ∼P [exp t(g(θ) -E θ∼P [g(θ)])] ≥ sup t t (E θ ′ ∼Q [g(θ ′ )] -E θ∼P [g(θ)]) -t 2 R 2 /2, where the last inequality is by the subgaussianity of g(θ). Then consider the case of t > 0 and t < 0 (t = 0 is trivial), by AM-GM inequality (i.e. the arithmetic mean is greater than or equal to the geometric mean), the following is straightforward, |E θ ′ ∼Q [g(θ ′ )] -E θ∼P [g(θ)]| ≤ 2R 2 D KL (Q||P ). This completes the proof. The following lemma is the Kantorovich-Rubinstein duality of Wasserstein distance (Cédric, 2008) . Lemma A.3 (KR duality). For any two distributions P and Q, we have W(P, Q) = sup f ∈1-Lip(ρ) X f dP - X f dQ, where the supremum is taken over all 1-Lipschitz functions in the metric d, i.e. |f (x) -f (x ′ )| ≤ d(x, x ′ ) for any x, x ′ ∈ X . To connect total variation with KL divergence , we will use Pinsker's inequality (Polyanskiy & Wu, 2019, Theorem 6.5 ) and Bretagnolle-Huber inequality (Bretagnolle & Huber, 1979, Lemma 2.1) in this paper, for more discussion about these two inequalities, we refer readers to Canonne (2022). S ′ X ′ ↓ S → W ↓ ↙ F Figure 1: The relationship between random variables in UDA, where F = R µ ′ (W ) -R S (W ). Lemma A.4 (Pinsker's inequality). TV(P, Q) ≤ 1 2 D KL (P ||Q). Lemma A.5 (Bretagnolle-Huber inequality) . TV(P, Q) ≤ √ 1 -e -DKL(P ||Q) . Below is the variational formula (or golden formula) of mutual information. Lemma A.6 (Polyanskiy & Wu (2019, Corollary 3.1.) ). For two random variables X and Y , we have I(X; Y ) = inf P E X D KL (Q Y |X ||P ) , where the infimum is achieved at P = Q Y .

B OMITTED PROOFS AND ADDITIONAL RESULTS IN SECTION 4

B.1 PROOF OF THEOREM 4.1 Proof. Let Q = µ ′ , P = µ and g = ℓ, then Theorem 4.1 comes directly from Lemma A.2.

B.2 PROOF OF COROLLARY 4.2

Proof. As discussed in Remark 4.1, when the loss is bounded in [0, M ], it is guaranteed to be M 2subgaussian for any w ∈ W. Then, similar to the proof of Theorem 4.1, let Q = µ, P = µ ′ and g = ℓ and R = M 2 , then the following bound holds by Lemma A.2, Err(w) ≤ M 2 2 D KL (µ||µ ′ ). Then, by min{A, B} ≤ 1 2 (A + B), the remaining part is straightforward, Err(w) ≤ M √ 2 min{D KL (µ||µ ′ ), D KL (µ ′ ||µ)} ≤ M 2 D KL (µ||µ ′ ) + D KL (µ ′ ||µ). This completes the proof. B.3 PROOF OF THEOREM 4.2 Proof. Let w * = arg min w∈W E Z ′ [ℓ(f w (X ′ ), Y ′ )] + E Z [ℓ(f w (X), Y )]. By Lemma A.1, D KL (P X ′ ||P X ) ≥ sup t∈R,w∈W E X ′ [tℓ(f w (X ′ ), f w * (X ′ ))] -log E X e tℓ(fw(X),f w * (X)) . Recall that ℓ(f w ′ (X), f w (X)) is R-subgaussian, by using Lemma A.2 (let Q = P X ′ , P = P X and g(•) = ℓ(f w ′ (•), f w (•))), we have |E X ′ [ℓ(f w (X ′ ), f w * (X ′ ))] -E X [ℓ(f w (X), f w * (X))]| ≤ 2R 2 D KL (P X ′ ||P X ). For any f w ∈ F, by the symmetric and triangle property of the loss, we have E Z ′ [ℓ(f w (X ′ ), Y ′ )] ≤E X ′ [ℓ(f w (X ′ ), f w * (X ′ ))] + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] ≤E X [ℓ(f w (X), f w * (X))] + 2R 2 D KL (P X ′ ||P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] (8) = x ℓ(f w (x), f w * (x))dP X (x) + 2R 2 D KL (P X ′ ||P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] = x y ℓ(f w (x), f w * (x))dP Y |X=x (y)dP X (x) + 2R 2 D KL (P X ′ ||P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] ≤ x y ℓ(f w (x), y) + ℓ(y, f w * (x))dP Y |X=x (y)dP X (x) + 2R 2 D KL (P X ′ ||P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] (9) =E Z [ℓ(f w (X), Y )] + E Z [ℓ(Y, f w * (X))] + 2R 2 D KL (P X ′ ||P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )], where Eq. ( 8) is by Eq. ( 7) and Eq. ( 9) is again by the triangle property of the loss function. Thus, Err(w) ≤ 2R 2 D KL (P X ′ ||P X ) + λ * , which completes the proof.

B.4 PROOF OF THEOREM 4.3

Proof. By Lemma A.1, D KL (P X ′ ||P X ) ≥ sup t∈R,w,w ′ ∈W 2 E X ′ [tℓ(f w (X ′ ), f w ′ (X ′ ))] -log E X e tℓ(fw(X),f w ′ (X)) ≥ sup t∈R E W,W ′ E X ′ [tℓ(f W (X ′ ), f W ′ (X ′ ))] -log E X e tℓ(f W (X),f W ′ (X)) ≥ sup t∈R E W,W ′ [E X ′ [tℓ(f W (X ′ ), f W ′ (X ′ ))]] -log E W,W ′ E X e tℓ(f W (X),f W ′ (X)) , where the last inequality is by applying Jensen's inequality to the logarithm function, which is concave. By the subgaussian assumption, |E W,W ′ [E X ′ [ℓ(f W (X ′ ), f W ′ (X ′ ))]] -E W,W ′ [E X [ℓ(f W (X), f W ′ (X))]]| ≤ 2R 2 D KL (P X ′ ||P X ). This concludes the proof.

B.5 PROOF OF THEOREM 4.4

Proof. From the definition, we have Err(w) = |E Z ′ [ℓ(f w (X ′ ), Y ′ )] -E Z [ℓ(f w (X), Y )]| ≤βW(µ, µ ′ ). where the last inequality is by the KR duality of Wasserstein distance (see Lemma A.3).

B.6 PROOF OF COROLLARY 4.3

Proof. When d is the discrete metric, Wasserstein distance is equal to the total variation, then by Theorem 4.4, Err(w) ≤ βTV(µ ′ , µ), The remaining part is by using Lemma A.4 and Lemma A.5: βTV(µ ′ , µ) ≤ β min 1 2 D KL (µ ′ ||µ), 1 -e -DKL(µ ′ ||µ) . Then, if ℓ is bounded by M , we can replace β by M above, which completes the proof. B.7 PROOF OF THEOREM 4.5 Proof. Let w * = arg min w∈W E Z ′ [ℓ(f w (X ′ ), Y ′ )] + E Z [ℓ(f w (X), Y )]. If ℓ(f w (X), f w ′ (X) ) is β-Lipschitz in X for any w, w ′ ∈ W, then similar to Theorem 4.4, it's easy to show that E X ′ [ℓ(f w (X ′ ), f * (X ′ ))] -E X [ℓ(f w (X), f * (X))] ≤ βW(P ′ X , P X ) For any f w ∈ F, by the symmetric and triangle property of the loss, we have E Z ′ [ℓ(f w (X ′ ), Y ′ )] ≤E X ′ [ℓ(f w (X ′ ), f w * (X ′ ))] + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] ≤E X [ℓ(f w (X), f w * (X))] + βW(P ′ X , P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )] (11) ≤E Z [ℓ(f w (X), Y )] + E Z [ℓ(Y, f w * (X))] + βW(P ′ X , P X ) + E Z ′ [ℓ(f w * (X ′ ), Y ′ )], where Eq. ( 11) is by Eq. ( 10) and the last inequality is again by the triangle property of the loss function. This completes the proof.

B.8 ADDITIONAL RESULTS: SAMPLE COMPLEXITY BOUNDS

One of the main ingredients to derive our sample complexity bound is the following lemma, where a concentration bound for a class of unbounded functions is given. Lemma B.1 (Cortes et al. (2019, Corollary 9)). Let κ > 2 and G = {g : Z → R s.t. E µ e g(Z) < +∞}. Assume E µ [g(Z) κ ] < +∞ for all g ∈ G. Let μ be the empirical distributions consist of n data points sampled i.i.d. from µ. If G has the finite pseudo-dimension d, then for ∀δ ∈ (0, 1), the following inequality holds for all g ∈ G with probability at least 1 -δ, E µ [g(Z)] ≤ E μ [g(Z)] + 2Λ(κ) κ E µ [g(Z) κ ] 1 n d log 2en d + log 4 δ , where Λ(κ) = 1 2 2 κ κ κ-2 κ-1 κ . Below is another useful lemma for the bounded case, which comes from Mohri et al. (2018, Theorem 11.8 ) with a slight modification (by invoking a different VC-dimension based generalization bound from Vapnik (1998)  ). Lemma B.2. Let F = {f : Z → R + }. Assume E µ [f (Z)] < M for all f ∈ F for some constant M > 0. Let μ be the empirical distributions consist of n data points sampled i.i.d. from µ. If F has the finite pseudo-dimension d, then for ∀δ ∈ (0, 1), the following inequality holds for all f ∈ F with probability at least 1 -δ, E µ [f (Z)] ≤ E μ [f (Z)] + 2M 1 n d log 2en d + log 4 δ . We are now in a position to state our sample complexity bound. Theorem B.1. Let μ and μ′ be the empirical distributions consist of n source data and m target data sampled i.i.d. from µ and µ ′ , respectively. Let G = {g : Z → R s.t. E µ e g(Z) < ∞} with finite pseudo-dimension d 1 , and let the pseudo-dimension of {exp • g|g ∈ G} be d 2 . Let κ > 2 and assume that E µ [g(Z) κ ] < +∞ for all g ∈ G. Assume there exists a constant α ≤ min g∈G {E μ e g(Z) , E µ e g(Z) }. Then for ∀δ ∈ (0, 1) the following bound holds with probability at least 1 -δ, D KL (µ ′ ||µ)-D KL (μ ′ ||μ) ≤ C 1 (κ) 1 n d 1 log 2en d 1 + log 4 δ +C 2 (α) 1 m d 2 log 2em d 2 + log 4 δ , where Z) . C 1 (κ) = 1 2 2-κ κ κ κ-2 κ-1 κ sup g∈G κ E µ [g(Z) κ ] and C 2 (α) = 2 α sup g∈G E µ e g( Proof. Recall Lemma A.1, we have Z) , and Z) . D KL (µ ′ ||µ) = sup g∈G E µ ′ [g(Z ′ )] -log E µ e g( D KL (μ ′ ||μ) = sup g∈G E μ′ [g(Z ′ )] -log E μ e g( Then, with the probability at least 1 -δ, D KL (µ ′ ||µ) -D KL (μ ′ ||μ) = sup g∈G E µ ′ [g(Z ′ )] -log E µ e g(Z) -sup g∈G E μ′ [g(Z ′ )] -log E μ e g(Z) ≤ sup g∈G E µ ′ [g(Z ′ )] -log E µ e g(Z) -E μ′ [g(Z ′ )] -log E μ e g(Z) = sup g∈G E µ ′ [g(Z ′ )] -E μ′ [g(Z ′ )] + log E μ e g(Z) -log E µ e g(Z) ≤ sup g∈G |E µ ′ [g(Z ′ )] -E μ′ [g(Z ′ )]| + sup g∈G log E μ e g(Z) -log E µ e g(Z) ≤ sup g∈G |E µ ′ [g(Z ′ )] -E μ′ [g(Z ′ )]| + sup g∈G 1 α E μ e g(Z) -E µ e g(Z) (12) ≤C 1 (κ) 1 n d 1 log 2en d 1 + log 4 δ + C 2 (α) 1 m d 2 log 2em d 2 + log 4 δ , where Eq. ( 12) is derived below. W.L.O.G. assume that E μ e g(Z) ≤ E µ e g(Z) (and Eq. ( 12) still holds when E μ e g(Z) ≥ E µ e g(Z) ), then log E μ e g(Z) -log E µ e g(Z) = log E µ e g(Z) E μ e g(Z) = log 1 + E µ e g(Z) E μ e g(Z) -1 Z) . ≤ E µ e g(Z) E μ e g(Z) -1 = 1 E μ e g(Z) E µ e g(Z) -E μ e g(Z) ≤ 1 α E µ e g(Z) -E μ e g( Eq. ( 13  Err(w) ≤ √ 2R D KL (μ ′ ||μ) + C 1 (κ) 1 n d 1 log 2en d 1 + log 4 δ + C 2 (α) 1 m d 2 log 2em d 2 + log 4 δ , where C 1 (κ) and C 2 (α) are the same as in Theorem B.1.

B.9 ADDITIONAL DISCUSSIONS ON THE CONVERGENCE OF EMPIRICAL KL DIVERGENCE

Characterizing the convergence of the empirical KL divergence to the real KL is a challenging task that often requires several additional assumptions, as demonstrated in Theorem B.1. However, it is worth noting that the convergence rate of the empirical distribution to the real distribution in the KL sense is already established in the discrete space. This fact is supported by a classic result in (Cover & Thomas, 2006, Theorem 11.2 .1), which we state in the following theorem: Theorem B.2. Let μ and μ′ be defined as in Theorem B.1. Assume the space of Z is finite (i.e. |Z| ≤ ∞), then for ∀δ ∈ (0, 1), with probability at least 1 -δ, D KL (μ||µ) ≤ |Z| n log (n + 1) + 1 n log 1 δ , D KL (μ ′ ||µ ′ ) ≤ |Z| m log (m + 1) + 1 m log 1 δ . Thus, it suffices to ensure that the empirical KL converge to the real KL with the similar rate, although we do not know if there might exist a faster convergence rate.

B.10 GENERALIZE TO APPROXIMATE TRIANGLE INEQUALITY

In Section 4, some results require that the loss obeys the triangle inequality (i.e. Assumption 4), such as Theorem 4.2 and Theorem 4.5. While the 0 -1 loss satisfies Assumption 4, some other loss may not. Thus, to generalize Theorem 4.2 and Theorem 4.5, we invoke an approximate triangle inequality, which is originally defined in Crammer et al. (2008) . Assumption 5 (α-Triangle). ℓ(•, •) is symmetric and satisfies the following α-triangle inequality: ℓ(y 1 , y 2 ) ≤ α (ℓ(y 1 , y 3 ) + ℓ(y 3 , y 2 )) for any y 1 , y 2 , y 3 ∈ Y, where α ≥ 1 is a constant that may depend on the hypothesis space W and the loss ℓ. Remark B.1. We note that the squared loss satisfies 2-triangle inequality. Thus, Theorem 4.2 can be easily generalized below. Theorem B.3. If Assumption 5 holds and let ℓ(f w ′ (X), f w (X)) be R-subgaussian for any w, w ′ ∈ W. Then for any w, Err(w) ≤ (α 2 -1)R µ + α 2R 2 D KL (P X ′ ||P X ) + α 2 λ * , where λ * = min w∈W R µ ′ (w) + R µ (w). Theorem 4.5 can be generalized in the similar way. While Theorem B.3 strictly speaking is not a generalization bound, as it includes R µ in the bound, it shares the same underlying concept as Theorem 4.2. Namely, to minimize the population risk in the target domain, it is essential for the source domain and target domain to be similar, and for both R µ and λ * to be kept small.

C OMITTED PROOFS AND ADDITIONAL DISCUSSIONS IN SECTION 5

C.1 ADDITIONAL DISCUSSION ON THEOREM 5.1 To derive the bound in Theorem 5.1, we need to make use of the second equality in Eq. ( 1). In fact, by the definition of Err (the first equality in Eq. ( 1)), the unlabelled sample S ′ X ′ j does not explicitly appear, so one can easily apply the similar information-theoretic analysis starting from the first equality in Eq. ( 1), and obtain an upper bound that consists of I(W ; Z i ) and D KL (µ||µ ′ ). Precisely, the following bound holds, Theorem C.1. Assume ℓ(f w (X ′ ), Y ′ ) is R-subgaussian for any w ∈ W. Then |Err| ≤ 1 n n i=1 E 2R 2 I(W ; Z i ) + 2R 2 D KL (µ||µ ′ ). The proof of Theorem C.1 is nearly the same to the proof of (Wu et al., 2020, Corollary 2) and (Masiha et al., 2021 , Corollary 1). It's important to note that although I(W ; Z i ) ≤ I(W ; Z i |X ′ j ) = E X ′ j I X ′ j (W ; Z i ) , the bound in Theorem 5.1 is incomparable to the bound based on I(W ; Z i ). This is mainly due to the fact that we use the disintegrated version of mutual information, I X ′ j (W ; Z i ), and the expectation over X ′ j is outside of the square root, which is a convex function. Using I X ′ j (W ; Z i ) instead of Exploiting the fact that |Err| = 1 n n i=1 E W,Zi [ℓ(f W (X i ), Y i )] -E W,Z ′ [ℓ(f W (X ′ ), Y ′ )] = 1 m m j=1 E X ′ j 1 n n i=1 E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W,Z ′ |X ′ j [ℓ(f W (X ′ ), Y ′ )] ≤ 1 m m j=1 E X ′ j 1 n n i=1 E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W,Z ′ |X ′ j [ℓ(f W (X ′ ), Y ′ )] ≤ 1 nm m j=1 n i=1 E X ′ j E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [R µ ′ (W )] , where the last two inequalities are by the Jensen's inequality for the absolute function. Notice that D KL P W,Zi|X ′ j =x ′ j ||P W |X ′ j =x ′ j P Z ′ =E P W,Z i |X ′ j =x ′ j log P W,Zi|X ′ j =x ′ j P W |X ′ j =x ′ j P Z ′ =E P W,Z i |X ′ j =x ′ j log P W |Zi,X ′ j =x ′ j P Zi P W |X ′ j =x ′ j P Z ′ =E P W,Z i |X ′ j =x ′ j log P W |Zi,X ′ j =x ′ j P W |X ′ j =x ′ j + E P Z i log P Zi P Z ′ =I(W ; Z i |X ′ j = x ′ j ) + D KL (µ||µ ′ ). Recall Eq. ( 15), we then have |Err| ≤ 1 nm m j=1 n i=1 E X ′ j E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [R µ ′ (W )] ≤ 1 nm m j=1 n i=1 E X ′ j 2R 2 D KL P W,Zi|X ′ j ||P W |X ′ j P Z ′ = 1 nm m j=1 n i=1 E X ′ j 2R 2 (I X ′ j (W ; Z i ) + D KL (µ||µ ′ )) ≤ 1 nm m j=1 n i=1 E X ′ j 2R 2 I X ′ j (W ; Z i ) + 2R 2 D KL (µ||µ ′ ). This completes the proof. C.3 PROOF OF COROLLARY 5.1 Proof. We now modify the proof in Theorem 5.1.

Recall that

|Err| ≤ 1 nm m j=1 n i=1 E X ′ j E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [R µ ′ (W )] . We first decompose the right hand side, E W,Zi|X ′ j =x ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j =x ′ j [R µ ′ (W )] = E W,Zi|X ′ j =x ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j =x ′ j [R µ (W )] + E W |X ′ j =x ′ j [R µ (W )] -E W |X ′ j =x ′ j [R µ ′ (W )] ≤ E W,Zi|X ′ j =x ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j =x ′ j [R µ (W )] + E W |X ′ j =x ′ j [R µ (W ) -R µ ′ (W )] ≤ E W,Zi|X ′ j =x ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j =x ′ j [R µ (W )] + M √ 2 min{D KL (µ||µ ′ ), D KL (µ ′ ||µ)}, where the last inequality is by Corollary 4.2. Then for the first term in RHS, notice that D KL P W,Z|X ′ j =x ′ j ||P W,Zi|X ′ j =x ′ j =D KL P W |X ′ j =x ′ j P Z ||P W,Zi|X ′ j =x ′ j ≥ sup t E P W |X ′ j =x ′ j P Z [tℓ(f W (X), Y )] -log E P W,Z i |X ′ j =x ′ j [exp tℓ(f W (X i ), Y i )] ≥ sup t E P W |X ′ j =x ′ j P Z [tℓ(f W (X), Y )] -E P W,Z i |X ′ j =x ′ j [tℓ(f W (X i ), Y i )] -log E P W,Z i |X ′ j =x ′ j e t(ℓ(f W (Xi),Yi)-E P W,Z i |X ′ j =x ′ j [ℓ(fW (Xi),Y )] ) ≥ sup t E P W |X ′ j =x ′ j [tR µ (W )] -E P W,Z i |X ′ j =x ′ j [tℓ(f W (X i ), Y i )] -M 2 t 2 /8, where the last inequality is due to the fact that ℓ is bounded by M and ℓ (f W (X i ), Y i ) is M/2- subgaussian. Thus, E W,Zi|X ′ j =x ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j =x ′ j [R µ (W )] ≤ M 2 2 D KL P W |X ′ j =x ′ j P Z ||P W,Zi|X ′ j =x ′ j = M 2 2 L W, Z i |X ′ j = x ′ j . Plugging this inequality with the decomposition into the inequality at the beginning of the proof, we have |Err| ≤ 1 nm m j=1 n i=1 E X ′ j M 2 2 L X ′ j (W, Z i ) + M √ 2 min{D KL (µ||µ ′ ), D KL (µ ′ ||µ)}. Similar development also holds for D KL P W,Zi|X ′ j =x ′ j ||P W |X ′ j =x ′ j P Z as in the proof of Theo- rem 5.1, thus |Err| ≤ M √ 2nm m j=1 n i=1 E X ′ j min I X ′ j (W ; Z i ), L X ′ j (W ; Z i ) + M √ 2 min {D KL (µ||µ ′ ), D KL (µ ′ ||µ)}. This completes the proof. C.4 PROOF OF THEOREM 5.2 Proof. Similar to the proof of Corollary 5.1, recall Theorem 4.4, |Err| ≤ 1 nm m j=1 n i=1 E X ′ j E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [R µ (W )] + E W |X ′ j [R µ (W )] -E W |X ′ j [R µ ′ (W )] ≤ 1 nm m j=1 n i=1 E X ′ j E W,Zi|X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [R µ (W )] + βW(µ, µ ′ ) ≤ 1 nm m j=1 n i=1 E X ′ j ,Zi E W |Zi,X ′ j [ℓ(f W (X i ), Y i )] -E W |X ′ j [ℓ(f W (X i ), Y i )] + βW(µ, µ ′ ) ≤ β ′ nm m j=1 n i=1 E X ′ j ,Zi W(P W |X ′ j ,Zi , P W |X ′ j ) + βW(µ, µ ′ ), where the last inequality is by Lemma A.3. This concludes the proof. C.5 PROOF OF COROLLARY 5.2 Proof. Similar to the proof of Corollary 4.3, replacing Wasserstein distance by the total variation and replacing β and β ′ by M , will give us the first inequality, Err ≤ M nm m j=1 n i=1 E X ′ j ,Zi W |Zi,X ′ j , P W |X ′ j ) + M TV(µ, µ ′ ). The second inequality is by Lemma A.4, Err ≤ M nm m j=1 n i=1 E X ′ j ,Zi 1 2 D KL (P W |Zi,X ′ j ||P W |X ′ j ) + M 2 2 D KL (µ||µ ′ ). Again, one can also apply Lemma A.5 here. This concludes the proof. C.6 PROOF OF THEOREM 5.3 Proof. Recall Theorem 5.1 and by Jensen's inequality we have |Err| ≤ 1 nm m j=1 n i=1 E X ′ j 2R 2 I X ′ j (W ; Z i ) + 2R 2 D KL (µ||µ ′ ) ≤ 2R 2 nm m j=1 n i=1 I(W ; Z i |X ′ j ) + 2R 2 D KL (µ||µ ′ ). Let X ′ 1,...,j-1,j+1,...,m = S ′ X ′ \ X ′ j . Notice that I(W ; Z i |S ′ X ′ ) =I(W ; Z i |S ′ X ′ ) + I(X ′ 1,...,j-1,j+1,...,m ; Z i |X ′ j ) =I(W ; Z i |X ′ j ) + I(X ′ 1,...,j-1,j+1,...,m ; Z i |X ′ j , W ) ≥I(W ; Z i |X ′ j ). Thus, I(W ; Z i |X ′ j ) ≤ I(W ; Z i |S ′ X ′ ). Then 1 nm m j=1 n i=1 I(W ; Z i |X ′ j ) ≤ 1 nm m j=1 n i=1 I(W ; Z i |S ′ X ′ ) = 1 n n i=1 I(W ; Z i |S ′ X ′ ). Then, since S ⊥ ⊥ S ′ X ′ and Z i ⊥ ⊥ Z 1:i-1 for any i ∈ [n], by the chain rule of mutual information, we have I(W ; S|S ′ X ′ ) = n i=1 I(W ; Z i |S ′ X ′ , Z 1:i-1 ) = n i=1 I(W ; Z i |S ′ X ′ , Z 1:i-1 ) + I(Z i ; Z 1:i-1 ) = n i=1 I(W, Z 1:i-1 ; Z i |S ′ X ′ ) = n i=1 I(W ; Z i |S ′ X ′ ) + I(Z i ; Z 1:i-1 |S ′ X ′ , W ) ≥ n i=1 I(W ; Z i |S ′ X ′ ). Thus, the generalization error bound becomes |Err| ≤ 2R 2 n I(W ; S|S ′ X ′ ) + 2R 2 D KL (µ||µ ′ ). Recall the updating rule of W and notice that W 0 is independent of S and S ′ X ′ , the following process is by using the chain rule of mutual information and data processing inequality recurrently, I(W T ; S|S ′ X ′ ) =I(W T -1 -η T g(W T -1 , Z B T , X ′ B T ) + N T ; S|S ′ X ′ ) ≤I(W T -1 , -η T g(W T -1 , Z B T , X ′ B T ) + N T ; S|S ′ X ′ ) =I(W T -1 ; S|S ′ X ′ ) + I(η T g(W T -1 , Z B T , X ′ B T ) + N T ; S|S ′ X ′ , W T -1 ) . . . = T t=1 I(η t g(W t-1 , Z Bt , X ′ Bt ) + N t ; S|S ′ X ′ , W t-1 ). For each t ∈ [T ], denote g(W t-1 , Z Bt , X ′ Bt ) as G t , then I(η t g(W t-1 , Z Bt , X ′ Bt ) + N t ; S|S ′ X ′ , W t-1 ) =E S ′ X ′ ,Wt-1,S D KL (P Gt+ N t η t |S,S ′ X ′ ,Wt-1 ||P Gt+ N t η t |S ′ X ′ ,Wt-1 ) ≤E S ′ X ′ ,Wt-1,S D KL (P Gt+ N t η t |S,S ′ X ′ ,Wt-1 ||P E S [Gt]+ N t η t |S ′ X ′ ,Wt-1 ) = η 2 t 2σ 2 t E S ′ X ′ ,Wt-1,S ||G t -E S [G t ]|| 2 , where the inequality is by Lemma A.6 and the last equality is by the KL divergence between two Gaussian distributions. Finally, putting everything together, |Err| ≤ R 2 n T t=1 η 2 t σ 2 t E S ′ X ′ ,Wt-1,S ||G t -E S [G t ]|| 2 + 2R 2 D KL (µ||µ ′ ), which concludes the proof. C.7 DERIVATION OF EQ. ( 6) Recall the expected cross-entropy loss, we have In Section 5, we discussed the LIMIT approach proposed by Harutyunyan et al. (2020) as a means of controlling label information memorization during training. Roughly speaking, to update the classifier parameters, LIMIT constructs an auxiliary network that predicts gradients instead of using the true gradients, which avoids direct use of the true labels for training. To obtain accurate gradients, the auxiliary network needs to be trained using the true labels. We found that the training of LIMIT is unstable and difficult to tune the hyperparameters when used under UDA settings. Therefore, we opted to use the pseudo label strategy proposed in Section 5 instead of the pseudo gradient strategy. E W,Zi [ℓ(f W (T i ), Y i )] = E Zi,W -log Q Yi|Ti,W = E Zi,

D EXPERIMENT DETAILS

We implemented our approach using PyTorch (Paszke et al., 2019) where L(W, Z Bt , X ′ Bt ) is some loss function for the source and target domain data in the current mini-batch and λ 1 is the trade-off coefficient. For example, if we combine ERM with gradient penalty then L(W, Z Bt , X ′ Bt ) = 1 |Bt| k∈Bt ℓ(f W (X k ), Y k ) and ℓ could be the cross-entropy loss. Moreover, if we combine KL guided marginal alignment algorithm (Nguyen et al., 2022) with gradient penalty then the objective function is min W,θ 1 |B t | k∈Bt ℓ(f W (T k ), Y k ) + β 1 D KL (P T ′ ||P T ) + β 2 D KL (P T ||P T ′ ) + λ 1 ||g(W, Z Bt , X ′ Bt )|| 2 , where θ is the parameters of the representation network and the gradient is g(W, Z Bt , X ′ Bt ) = 1 |B t | k∈Bt ∇ W,θ ℓ(f W (T k ), Y k ) + β 1 ∇ θ D KL (P T ′ ||P T ) + β 2 ∇ θ D KL (P T ||P T ′ ). In Nguyen et al. (2022) , the representation distribution is modelled as an Gaussian distribution, i. empirical KL divergence is estimated by the mini-batch data, as given in Nguyen et al. (2022) , When we train the model with controlling label information, the objective function becomes β min W L(W, Z Bt , X ′ Bt ) + λ 2 ||W -W || 2 , where W is the auxiliary classifier and λ 2 is the trade-off hyperparameter. Similarly, when we combine KL guided marginal alignment algorithm with controlling label information, then the objective function in every iteration is min W,θ 1 |B t | k∈Bt ℓ(f W (T k ), Y k ) + β 1 D KL (P T ′ ||P T ) + β 2 D KL (P T ||P T ′ ) + λ 2 ||W -W || 2 . In addition, the training objective for the auxiliary classifier is min W 1 |B t | k∈Bt ℓ(f W (T ′ k ), f W (T ′ k )) + 1 |B t | k∈Bt ℓ(f W (T k ), f W (T k )). In practice, removing the second term would not affect the performance. Note that we need to disenable the automatic differentiation of T , T ′ and W when executing the backward pass for the auxiliary classifier. The detailed algorithm of controlling label information is given in the next section.

D.2 ALGORITHM OF CONTROLLING LABEL INFORMATION AND ADDITIONAL RESULTS OF ERM-CL

If we only provide the pseudo labels for the target domain data to the auxiliary classifier, i.e. removing the second term in Eq (16), the Algorithm 1 is the algorithm for combining any marginal alignment algorithm with controlling label information. Even without incorporating with the marginal alignment algorithm, e.g., ERM, in which case L r is removed, Algorithm 1 still boosts the performance in practice. Table 2 shows that ERM-CL can overall outperform the basic ERM and is close to the performance of ERM-GP.



A random variable X is R-subgaussian if for any ρ, log E exp (ρ (X -EX)) ≤ ρ R 2 /2. Some losses that only satisfy a general version of Assumption 4 are discussed in Appendix B.10 Available at: https://github.com/facebookresearch/DomainBed. Available at: https://github.com/atuannguyen/kl.



Omitted Proofs and Additional Results in Section 4 17 B.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Proof of Corollary 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.4 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.5 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.6 Proof of Corollary 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.7 Proof of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.8 Additional Results: Sample Complexity Bounds . . . . . . . . . . . . . . . . . 19 B.9 Additional Discussions on the Convergence of Empirical KL Divergence . . . . 20 B.10 Generalize to Approximate Triangle Inequality . . . . . . . . . . . . . . . . . . 21 C Omitted Proofs and Additional Discussions in Section 5 21 C.1 Additional Discussion on Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Proof of Corollary 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.4 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.5 Proof of Corollary 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.6 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.7 Derivation of Eq. (6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.8 Additional Discussion on LIMIT . . . . . . . . . . . . . . . . . . . . . . . . . 27 D Experiment Details 27 D.1 Objective Functions of Gradient Penalty and Controlling Label Information . . . 27 D.2 Algorithm of Controlling Label Information and Additional Results of ERM-CL 28 D.3 Architectures and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 29 D.4 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.5 Ablation Study on the Effect of Gradient Penalty Hyperparameter . . . . . . . . 29 D.6 Visualization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.7 Results on VisDA17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.8 Dynamics of Jeffrey's divergence . . . . . . . . . . . . . . . . . . . . . . . . . 31

) is by Lemma B.1 and Lemma B.2. This concludes the proof. With Theorem B.1 and Theorem 4.1, we immediately have the following corollary. Corollary B.1. Let the conditions in Theorem B.1 and Theorem 4.1 hold, then for any w ∈ W,

and conducted all experiments on NVIDIA Tesla V100 GPUs with 32 GB of memory. Our code builds largely on the implementation from Gulrajani & Lopez-Paz (2021) 3 and Nguyen et al. (2022) 4 .D.1 OBJECTIVE FUNCTIONS OF GRADIENT PENALTY AND CONTROLLING LABEL INFORMATIONFor every iteration, the objective function after adding the gradient penalty becomesmin W L(W, Z Bt , X ′ Bt ) + λ 1 ||g(W, Z Bt , X ′ Bt )|| 2 ,

e., T ∼ N (µ θ , σ 2 θ I d |X) and T ′ ∼ N (µ θ , σ 2 θ I d |X ′ ). Additionally, let the batch size be b = |B t |, the

1 D KL (P T ′ ||P T ) + β 2 D KL (P T ||P T ′ ) k |X k = N (µ θ , σ 2 θ I d |X k ) and P T ′ k |X ′ k = N (µ θ , σ 2 θ I d |X ′ k ).To be more precise, µ θ and σ θ are the outputs of the representation network. Since the forward pass requires the sampling of T k and T ′ k , we need to use the reparameterization trick (Kingma & Welling, 2013) for the backward pass.

Y ) is β-Lipschitz continuous in Z with respect to a metric d on Z for any w ∈ W, i.e., |ℓ(f w (x 1 ), y 1 ) -ℓ(f w (x 2 ), y 2 )| ≤ βd(z 1 , z 2 ) for some metric d on Z. If Assumption 2 holds, then for any w ∈ W, Err(w) ≤ 2R 2 D KL (µ ′ ||µ).

Theorem 4.1 andNguyen et al. (2022) both use the KL divergence from source domain to target domain, D KL (µ ′ ||µ), and in fact, Err(w) can also be upper bounded by D KL (µ||µ ′ ). This can be done by invoking the subgaussianality of ℓ(f w (X ′ ), Y ′ ) (rather than ℓ(f w (X), Y )); for bounded loss, the subgaussianality of ℓ(f w (X ′ ), Y ′ ) is also satisfied. Then we obtain the following corollary. KL (µ||µ ′ ) + D KL (µ ′ ||µ). Remark 4.3. In the second inequality of Corollary 4.2, D KL (µ||µ ′ ) + D KL (µ ′ ||µ) is known as the symmetrized KL divergence, or Jeffrey's divergence

RotatedMNIST and Digits. Results of baselines are reported fromNguyen et al. (2022).Datasets We select two popular small datasets, RotatedMNIST and Digits, to compare the different methods. RotatedMNIST is built based on the MNIST dataset(LeCun et al., 2010) and consists of six domains, each containing 11, 666 images. These six domains are rotated MNIST images with rotation angle 0 • , 15 • , 30 • , 45 • , 60 • and 75

W log P Yi|Ti,W Q Yi|Ti,W P Yi|Ti,W = H(Y i |T i , W ) + E Xi,W D KL (P Yi|Ti,W ||Q Yi|Ti,W ) = E Zi,W log P Yi|Ti P W |Ti P Yi|Ti,W P Yi|Ti P W |Ti + E Ti,W D KL (P Yi|Ti,W ||Q Yi|Ti,W ) = E Zi,W log P Yi|Ti P W |Ti P Yi,W |Ti P Yi|Ti + E Ti,W D KL (P Yi|Ti,W ||Q Yi|Ti,W ) = H(Y i |T i ) -I(W ; Y i |T i ) + E Ti,W D KL (P Yi|Ti,W ||Q Yi|Ti,W )C.8 ADDITIONAL DISCUSSION ON LIMIT

RotatedMNIST and Digits Experiments of ERM-CL. Results of ERM are reported fromNguyen et al. (2022).

ACKNOWLEDGMENTS

This work is supported partly by an NSERC Discovery grant and a National Research Council of Canada (NRC) Collaborative R&D grant (AI4D-CORE-07). Ziqiao Wang is also supported in part by the NSERC CREATE program through the Interdisciplinary Math and Artificial Intelligence (INTER-MATH-AI) project. The authors would like to thank the anonymous reviewers for their careful reading and valuable suggestions.

Appendix Table of Contents

A Some Prerequisite Definitions and Useful Lemmas I(W ; Z i ) allows us to figure out more details about the role of unlabelled target data in the algorithm. Additionally, one can also prove a bound based on I(W ; Z i |X ′ j ) (e.g., simply applying Jensen's inequality to Theorem 5.1), which is close to an individual and UDA version of (Bu et al., 2022, Theorem 3) .Furthermore, the first term in Theorem 5.1 characterize the expected generalization gap on the source domain (i.e. E W,S [R µ (W ) -R S (W )]), then the bound suggests us that it's possible to invoke the unlabelled target data to further improve the performance on source domain, and the simplest case is the semi-supervised learning (when µ = µ ′ ).Compared with (Wu et al., 2020; Jose & Simeone, 2021b) . Notably, bounds in (Wu et al., 2020; Jose & Simeone, 2021b) fail to characterize the dependence between W and S ′ X ′ . More precisely, the algorithm-dependent term in their bounds is I(W ; Z i ) or I(W ; S), while our algorithmdependent term is I X ′ j (W ; Z i ) that directly depends on the unlabelled target data. Moreover, while the disintegrated mutual information I X ′ j (W ; Z i ) and the unconditional mutual information I(W ; Z i ) cannot be directly compared, recent work by Wang & Mao (2023) provides empirical evidence comparing similar terms in the supervised learning setting. Specifically, they demonstrate that when the empirical risk is small, such as in a realizable case, the disintegrated mutual information is smaller than the unconditional mutual information. Conversely, when the empirical risk is large, the unconditional mutual information is the smaller of the two.More Discussion on the Vanishing of I(X ′ j ; Z i |W ) in Remark 5.1. Note that S depends on S ′ X ′given W , so intuitively the dependence between each individual instance Z i and X ′ j is weaker when n and m become larger. More precisely, W.L.O.G let i = j = 1, and recall that W = A(S ′ X ′ , S), when n, m → ∞, taking S and S ′ X ′ as the input of the algorithm is nearly equivalent to computing W based on the source distribution µ and the target distribution P X ′ , thus, W will only depend on the two distributions, without depending on the realizations Z 1 and X ′ 1 drawn respectively from the two distributions, that is,In addition, one may argue that what if W = constant that does not really depend on the input data. In this case, I(Z 1 ; X ′ 1 |W ) = I(Z 1 ; X ′ 1 ) = 0 will hold trivially. In the other extreme, if n = 1 and m = 1, then W = A(X ′ 1 , Z 1 ), and the quantity I(Z 1 ; X ′ 1 |A(X ′ 1 , Z 1 )) should be large. When n and m increase, it becomes I(Z 1 ; X ′ 1 |A(X ′ 1:m , Z 1:n )). Now we want to guess Z 1 from X ′ 1 , this should be easier when having the knowledge of A(X ′ 1 ; Z 1 ) compared with when having the knowledge ofC.2 PROOF OF THEOREM 5.1Proof. By Lemma A.1,where Eq. ( 14) is by the independence between algorithm output W and unseen target domain data Z ′ , and the last inequality is by the subgaussian assumption.Thus,Algorithm 1 Controlling Label Information Require: Source domain labelled dataset S, Target domain unlabelled dataset S ′ X ′ , Batch size b, Classification loss function ℓ c , Marginal alignment loss function ℓ r , Initial classifier parameter w 0 = w 0 , Initial representation network parameter θ 0 , Learning rate η, Lagrange multiplier λ 2 while w t , θ t not converged do 2:Update iteration:Compute distance from the auxiliary classifier dis ← ||w t -w t || 2 6:Compute marginal alignment lossCompute gradient:Obtain the pseudo labelsCompute auxiliary classifier gradient: g B ← ∇L a Update auxiliary classifier parameter: w t+1 ← w t -η • g B 14: end while Other settings are also the same as Gulrajani & Lopez-Paz (2021) and Nguyen et al. (2022) , for example, each algorithm is trained for 100 epochs. To select the hyperparameters (λ 1 and λ 2 ) for ERM-GP, ERM-KL, KL-GP and KL-CL, we perform random search. Specifically, λ 1 is searched between [0.1, 0.9] and λ 2 is searched between [10 -6 , 0.8]. Other hyperparameters searching range could be found in the source code of Nguyen et al. (2022) .

D.5 ABLATION STUDY ON THE EFFECT OF GRADIENT PENALTY HYPERPARAMETER

Our study includes an ablation analysis to investigate the impact of the hyperparameter λ 1 in the context of KL-GP. Specifically, we conduct experiments on both RotatedMNIST and Digits datasets, where the source and target domains are set to 0°/60°and SVHN/MNIST, respectively. Table 3 summarizes the results. It is worth noting that setting λ 1 to zero effectively reduces KL-GP to KL, and our results confirm the efficacy of including the gradient penalty term in KL-GP.

D.6 VISUALIZATION RESULTS

To visualize the representations of models trained using KL, KL-GP, and KL-CL, we employ t-SNE (Van der Maaten & Hinton, 2008) . is essential to note that these regularization terms are primarily designed to enhance the performance of the classifier network, rather than the representation network.

D.7 RESULTS ON VISDA17

We also conduct experiments on the VisDA17 dataset (Peng et al., 2017) , which is a real-world classification task with 280K images from 12 classes. Particularly, the source domain contains synthetic images and the target domain contains real images. The representation space version of Corollary 4.2 suggests that a small Jeffrey's divergence can lead to a low testing error. Figure 3a demonstrates that the dynamic of Jeffrey's divergence, as computed in the representation space, can effectively characterize the evolution of the testing error throughout the training phase. Additionally, Figure 3b reveals that the number of target data used has an impact on testing performance. Specifically, when less than half of the available unlabelled target data is used, performance increases with the number of data. However, when more than half of the unlabelled target data is used, there is only marginal improvement on performance.

