INFORMATION-THEORETIC ANALYSIS OF UNSUPER-VISED DOMAIN ADAPTATION

Abstract

This paper uses information-theoretic tools to analyze the generalization error in unsupervised domain adaptation (UDA). We present novel upper bounds for two notions of generalization errors. The first notion measures the gap between the population risk in the target domain and that in the source domain, and the second measures the gap between the population risk in the target domain and the empirical risk in the source domain. While our bounds for the first kind of error are in line with the traditional analysis and give similar insights, our bounds on the second kind of error are algorithm-dependent, which also provide insights into algorithm designs. Specifically, we present two simple techniques for improving generalization in UDA and validate them experimentally.

1. INTRODUCTION

This paper focuses on the unsupervised domain adaptation (UDA) task, where the learner is confronted with a source domain and a target domain and the algorithm is allowed to access to a labeled training sample from the source domain and an unlabeled training sample from the target domain. The goal is to find a predictor that performs well on the target domain. A main obstacle in such a task is the discrepancy between the two domains. Some recent works (Ben-David et al., 2006; 2010; Mansour et al., 2009; Zhao et al., 2019; Zhang et al., 2019; Shen et al., 2018; Germain et al., 2020; Acuna et al., 2021; Nguyen et al., 2022) have proposed various measures to quantify such discrepancy, either for the UDA setting or for the more general domain generalization tasks, and many learning algorithms are proposed. For example, Nguyen et al. (2022) uses a (reverse) KL divergence to measure the misalignment of the two domain distributions, and motivated by their generalization bound, they design an algorithm that penalizes the KL divergence between the marginal distributions of two domains in the representation space. Despite that this "KL guided domain adaptation" algorithm is demonstrated to outperform many existing marginal alignment algorithms (Ganin et al., 2016; Sun & Saenko, 2016; Shen et al., 2018; Li et al., 2018) , it is not clear whether KL-based alignment of marginal distributions is adequate for UDA, and more fundamentally, what role the unlabelled target-domain sample should play in cross-domain generalization. Notably, most UDA algorithms are heuristically designed and intuitively justified. Moreover, most existing generalization bounds are algorithm-independent. Then there appears significant room for both deeper theoretical understanding and more principled algorithm design. In this paper, we analyze the generalization ability of hypotheses and learning algorithms for UDA tasks using an information-theoretic framework developed in (Russo & Zou, 2016; Xu & Raginsky, 2017) . The foundation of our technique is the Donsker-Varadhan representation of KL divergence (see Lemma A.1). We present novel upper bounds for two notions of generalization errors. The first notion ("population-to-population (PP) generalization error") measures the gap between the population risk in the target domain and that in the source domain for a hypothesis, and the second ("expected empirical-to-population (EP) generalization error") measures the gap between the population risk in the target domain and the empirical risk in the source domain for a learning algorithm. We show that the PP generalization error for all hypotheses are uniformly bounded by a quantity governed by the KL divergence between the two domain distributions, which, under bounded losses, recovers the the bound in Nguyen et al. (2022) . We then show that this KL term upper-bounds some other measures including Total-Variation distance (Ben-David et al., 2006) , Wasserstein dis-tance (Shen et al., 2018) and domain disagreement (Germain et al., 2020) . Thus, minimizing KLdivergence forces the minimization of other discrepancy measures as well. This, together with the ease of minimizing KL (Nguyen et al., 2022) , explains the effectiveness of the KL-guided alignment approach. For expected EP generalization error, we develop several algorithm-dependent generalization bounds. These algorithm-dependent bounds further inspire the design of two new and yet simple strategies that can further boost the performance of the KL guided marginal alignment algorithms. Experiments are performed to verify the effectiveness of these strategies.

2. RELATED WORK

Domain Adaptation Many domain adaptation generalization bounds have been developed (Ben-David et al., 2006; 2010; David et al., 2010; Mansour et al., 2009; Shen et al., 2018; Zhang et al., 2019; Germain et al., 2020; Acuna et al., 2021) , and various discrepancy measures are introduced to derive these bounds including total variation (Ben-David et al., 2006; 2010; David et al., 2010; Mansour et al., 2009) 2010) also assumes the loss is the L 1 distance between the predicted label and true label (which is bounded). Our bounds work for the general supervised learning problems with any labelling mechanism (e.g., stochastic labelling), and we do not require the specific choice of the loss (even unbounded). Recently, Shui et al. ( 2020) proposed generalization bounds using Jensen-Shannon (JS) divergence, which bear a relation to our Corollary 4.2. While other algorithm-dependent bounds have been proposed for different transfer learning settings (e.g., Wang et al. ( 2019)), they are not directly comparable to our own bounds. For more details about the domain adaptation theory, we refer readers to Redko et al. ( 2020) for a comprehensive survey. In addition, the most common methods for domain adaptation involve aligning the marginal distributions of the representations between the source and target domains, for example, using an adversarial training mechanism (Ganin et al., 2016; Shen et al., 2018; Acuna et al., 2021) or aligning the first two moments of the representation distribution (Sun & Saenko, 2016) . There are numerous other domain adaptation algorithms, and we refer readers to (Wilson & Cook, 2020; Zhou et al., 2021; Wang et al., 2021b) for recent advances. Information-Theoretic Generalization Bounds Information-theoretic analysis is usually used to bound the expected generalization error of supervised learning, where the training and testing data come from the same distribution (Russo & Zou, 2016; 2019; Xu & Raginsky, 2017; Bu et al., 2019; Negrea et al., 2019; Steinke & Zakynthinou, 2020; Rodríguez Gálvez et al., 2021) . Exploiting the chain rule of mutual information, these bounds are successfully applied to characterize the generalization ability of stochastic gradient based optimization algorithms (Pensia et al., 2018; Negrea et al., 2019; Haghifam et al., 2020; Wang et al., 2021a; Neu et al., 2021; Wang & Mao, 2022a; b) . Recently, this framework has also been used in other learning settings including meta-learning (Jose & Simeone, 2021a; Jose et al., 2021; Rezazadeh et al., 2021; Chen et al., 2021) , semi-supervised learning (He et al., 2021; Aminian et al., 2022) and transfer learning (Wu et al., 2020; Jose & Simeone, 2021a; b; Masiha et al., 2021; Bu et al., 2022) . In particular, (Wu et al., 2020; Jose & Simeone, 2021b) consider a different problem setup with ours. Specifically, their expected generalization error is the gap between the target population risk and a weighted empirical risk combining both the source and the target empirical risks, while our "EP" error is the gap between the target population risk and the source empirical risk. That is, we focus on the role of the unlabelled target data in cross-domain generalization when the source empirical risk is taken as a training objective, whereas their works assume the existence of labelled target data and study their role in domain adaptation.

3. PRELIMINARY

Unless otherwise noted, a random variable will be denoted by a capitalized letter, and its realization is denoted by the corresponding lower-case letter. Consider a prediction task with instance space Z = X × Y, where X and Y are the input space and the label (or output) space, respectively. Let F be the hypothesis space of interest, in which each f ∈ F is a function or predictor mapping X to Y. We assume that each hypothesis f ∈ F is parameterized by some weight parameter w in some space W and may write f as f w as needed.



, Wasserstein distance (Shen et al., 2018), domain disagreement (Germain et al., 2020) and so on. In particular, bounds based on H∆H in Ben-David et al. (2010) are restricted to a binary classification setting and assume a deterministic labeling function. Furthermore, Ben-David et al. (

