UNIFIED PRINCIPLES FOR MULTI-SOURCE TRANSFER LEARNING UNDER LABEL SHIFTS

Abstract

We study the label shift problem in multi-source transfer learning and derive new generic principles. Our proposed framework unifies the principles of conditional feature alignment, label distribution ratio estimation and domain relation weights estimation. Based on inspired practical principles, we provide unified practical framework for three multi-source label shift transfer scenarios: learning with limited target data, unsupervised domain adaptation and label partial unsupervised domain adaptation. We evaluate the proposed method on these scenarios by extensive experiments and show that our proposed algorithm can significantly outperform the baselines.

1. INTRODUCTION

Transfer learning (Pan & Yang, 2009) is based on the motivation that learning a new task is easier after having learned several similar tasks. By learning the inductive bias from a set of related source domains (S 1 , . . . , S T ) and then leveraging the shared knowledge upon learning the target domain T , the prediction performance can be significantly improved. Based on this, transfer learning arises in deep learning applications such as computer vision (Zhang et al., 2019; Tan et al., 2018; Hoffman et al., 2018b) , natural language processing (Ruder et al., 2019; Houlsby et al., 2019) and biomedical engineering (Raghu et al., 2019; Lundervold & Lundervold, 2019; Zhang & An, 2017) . To ensure a reliable transfer, it is critical to understand the theoretical assumptions between the domains. One implicit assumption in most transfer learning algorithms is that the label proportions remain unchanged across different domains (Du Plessis & Sugiyama, 2014) (i.e., S(y) = T (y)). However, in many real-world applications, the label distributions can vary markedly (i.e. label shift) (Wen et al., 2014; Lipton et al., 2018; Li et al., 2019b) , in which existing approaches cannot guarantee a small target generalization error, which is recently proved by Combes et al. (2020) . Moreover, transfer learning becomes more challenging when transferring knowledge from multiple sources to build a model for the target domain, as this requires an effective selection and leveraging the most useful source domains when label shift occurs. This is not only theoretically interesting but also commonly encountered in real-world applications. For example, in medical diagnostics, the disease distribution changes over countries (Liu et al., 2004; Geiss et al., 2014) . Considering the task of diagnosing a disease in a country without sufficient data, how can we leverage the information from different countries with abundant data to help the diagnosing? Obviously, naïvely combining all the sources and applying one-to-one single source transfer learning algorithm can lead to undesired results, as it can include low quality or even untrusted data from certain sources, which can severely influence the performance. In this paper, we study the label shift problem in multi-source transfer learning where S t (y) = T (y). We propose unified principles that are applicable for three common transfer scenarios: unsupervised Domain Adaptation (DA) (Ben-David et al., 2010) , limited target labels (Mansour et al., 2020) and partial unsupervised DA with supp(T (y)) ⊆ supp(S t (y)) (Cao et al., 2018) , where prior works generally treated them as separate scenario. It should be noted that this work deals with target shift without assuming that semantic conditional distributions are identical (i.e., S t (x|y) = T (x|y)), which is more realistic for real-world problems. Our contributions in this paper are two-folds: (I) We propose to use Wasserstein distance (Arjovsky et al., 2017) to develop a new target generalization risk upper bound (Theorem 1), which reveals the importance of label distribution ratio estimation and provides a principled guideline to learn the domain relation coefficients. Moreover, we provide a theoretical analysis in the context of representation learning (Theorem 2), which guides to learn a feature function that minimizes the conditional Wasserstein distance as well as controls the weighted source risk. We further reveal the relations in the aforementioned three scenarios lie in the different assumptions for estimating label distribution ratio. (II) Inspired by the theoretical results, we propose Wasserstein Aggregation Domain Network (WADN) for handling label-shift in multi-source transfer learning. We evaluate our algorithm on three benchmark datasets, and the results show that our algorithm can significantly outperform stateof-the-art principled approaches.

2. RELATED WORK

Multi-Source Transfer Learning Theories have been investigated in the previous literature with different principles to aggregate source domains. In the popular unsupervised DA, (Zhao et al., 2018; Peng et al., 2019; Wen et al., 2020; Li et al., 2018b ) adopted H-divergence (Ben-David et al., 2007) , discrepancy (Mansour et al., 2009) and Wasserstein distance (Arjovsky et al., 2017) of marginal distribution d(S t (x), T (x)) to estimate domain relations and dynamically leveraged different domains. These algorithms generally consists source risk, domain discrepancy and an un-observable term η, the optimal risk on all the domains, which are ignored in these approaches. However, as Combes et al. (2020) pointed out, ignoring the influence of η will be problematic when label distributions between source and target domains are significantly different. Therefore it is necessary to take η into consideration by using a small amount of labelled data is available for the target domain (Wen et al., 2020) . Following this line, very recent works (Konstantinov & Lampert, 2019; Wang et al., 2019a; Mansour et al., 2020) started to consider measure the divergence between two domains given label information for the target domain by using Y-discrepancy (Mohri & Medina, 2012) . However, we empirically showed these methods are still unable to handle label shift. Label-Shift Label-Shift (Zhang et al., 2013; Gong et al., 2016 ) is a common phenomena in the transfer learning with S(y) = T (y) and generally ignored by the previous multi-source transfer learning practice. Several theoretical principled approaches have been proposed such as (Azizzadenesheli et al., 2019; Garg et al., 2020) . In addition, (Combes et al., 2020; Wu et al., 2019) analyzed the generalized label shift problem in the one-to-one single unsupervised DA problem but did not provide guidelines of levering different sources to ensure a reliable transfer, which is more challenging. (Redko et al., 2019) proposed optimal transport strategy for the multiple unsupervised DA under label shift by assuming identical semantic conditional distribution. However they did not consider representation learning in conjunction with their framework and did not design neural network based approaches. Different from these, we analyzed our problem in the context of representation learning and propose an efficient and principled strategies. Moreover, our theoretical results highlights the importance of label shift problem in a variety of multi-source transfer problem. While the aforementioned work generally focus on the unsupervised DA problem, without considering unified rules for different scenarios (e.g. partial multi-source DA).

3. THEORETICAL INSIGHTS: TRANSFER RISK UPPER BOUND

We assume a scoring hypothesis defined on the input space X and output space Y with h : X × Y → R, is K-Lipschitz w.r.t. the feature x (given the same label), i.e. for ∀y, h(x 1 , y) -h(x 2 , y) 2 ≤ K x 1 -x 2 2 , and the loss function : R × R → R + is positive, L-Lipschitz and upper bounded by L max . We denote the expected risk w.r.t distribution D: R D (h) = E (x,y)∼D (h(x, y)) and its empirical counterpart (w.r.t. D) RD (h) = (x,y)∈ D (h(x, y)). We adopted Wasserstein-1 distance (Arjovsky et al., 2017) as a metric to measure the similarity of the domains. Compared with other divergences, Wasserstein distance has been theoretically proved tighter than TV distance (Gong et al., 2016) or Jensen-Shnannon divergence (Combes et al., 2020) . Based on previous work, the label shift is generally handled by label-distribution ratio weighted loss: R α S (h) = E (x,y)∼S α(y) (h(x, y)) with α(y) = T (y)/S(y). We also denote αt as its empirical counterpart, estimated from samples. Besides, to measure the task relations, we define a simplex λ with λ[t] ≥ 0, T t=1 λ[t] = 1 as the task relation coefficient vector by assigning high weight to the similar task. Then we first present Theorem 1, which proposed theoretical insights about how to combine source domains through properly estimating λ. Theorem 1. Let { Ŝt = {(x i , y i )} N S t i=1 } T t=1 and T = {(x i , y i )} N T i=1 , respectively be T source and target i.i.d. samples. For ∀h ∈ H with H the hypothesis family and ∀λ, with high probability ≥ 1 -4δ, the target risk can be upper bounded by: R T (h) ≤ t λ[t] Rαt St (h) + LK t λ[t]E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) + L max d sup ∞ T t=1 λ[t] 2 β t log(1/δ) 2N + L max sup t α t -αt 2 + Comp(N S1 , . . . , N ST , N T , δ), where N = T t=1 N St and β t = N St /N and d sup ∞ = max t∈[1,T ],y∈[1,Y] α t (y) the maximum true label distribution ratio value. W 1 (• •) is the Wasserstein-1 distance with L 2 -distance as cost func- tion. Comp(N S1 , . . . , N S T , N T , δ ) is a function that decreases with larger N S1 , . . . , N T , given a fixed δ and hypothesis family H. (See Appendix E for details) Remarks (1) In the first two terms, the relation coefficient λ is controlled by α t -weighted loss Rαt St (h) and conditional Wasserstein distance E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)). To minimize the upper bound, we need to assign a higher λ[t] to the source t with a smaller weighted prediction loss and a smaller weighted semantic conditional Wasserstein distance. Intuitively, we tend to leverage the source task which is semantic similar to the target and easier to learn. (2) If each source have equal observations with β t = 1, then the third term will become λ 2 , a L 2 norm regularization, which can be viewed as an encouragement of uniformly leveraging all the sources. Combing these three terms, we need to consider the trade-off between assigning a higher λ[t] to the source t that has a smaller weighted prediction loss and conditional Wasserstein distance, and assigning balanced λ[t] for avoiding concentrating on only one source. (3) αt -α t 2 indicates the gap between ground-truth and empirical label ratio. Therefore if we can estimate a good αt , these terms can be small. In the practice, If target labels are available, αt can be computed from the observed data and αt → α t . If target labels are absent (unsupervised DA), we need to design methods and to properly estimate αt (Sec. 4). ( 4) Comp(N S1 , . . . , N S T , N T , δ) is a function that reflects the convergence behavior, which decreases with larger observation numbers. If we fix H, δ, N and N T , this term can be viewed as a constant.

Insights in Representation Learning

Apart from Theorem 1, we propose a novel theoretical analysis in the context of representation learning, which motivates practical guidelines in the deep learning regime. We define a stochastic feature function g and we denote its conditional distribution w.r.t. latent variable Z (induced by g) as S(z|Y = y) = x g(z|x)S(x|Y = y)dx. Then we have: Theorem 2. We assume the settings of loss, the hypothesis are the same with Theorem 1. We further denote the stochastic feature learning function g : X → Z, and the hypothesis h : Z × Y → R. Then ∀λ, the target risk is upper bounded by: R T (h, g) ≤ t λ[t]R αt St (h, g) + LK t λ[t]E y∼T (y) W 1 (S t (z|Y = y) T (z|Y = y)), where R T (h, g) = E (x,y)∼T (x,y) E z∼g(z|x) (h(z, y)). Theorem 2 reveal that to control the upper bound, we need to learn g that minimizes the weighted conditional Wasserstein distance and learn (g, h) that minimizes the weighted source risk. Comparison with previous Theorems. Our theory proposed an alternative prospective to understand transfer learning. The first term is α-weighted loss. And it will recover the typical source loss minimization if there is no label shift with α t (y) ≡ 1 (Li et al., 2019a; Peng et al., 2019; Zhao et al., 2018; Wen et al., 2020) . Beside, minimizing the conditional Wasserstein distances has been shown to be advantageous, compared with W 1 (S t (z) T (z)) (Long et al., 2018) . Moreover, Theorem 2 explicitly proposed the theoretical insights about the representation learning function g, which remains elusive for previous multi-source transfer theories such as (Wang et al., 2019a; Mansour et al., 2020; Konstantinov & Lampert, 2019; Li et al., 2019a; Peng et al., 2019) . The theoretical results in Section 3 motivate general principles to follow when designing multisource transfer learning algorithms. We summarize those principles in the following rules. (I) Learn a g that minimizes the weighted conditional Wasserstein distance as well as learn (g, h) that minimizes the αt -weighted source risk (Sec. 4.1). (II) Properly estimate the label distribution ratio αt (Sec. 4.2). (III) Balance the trade-off between assigning a higher λ[t] to the source t that has a smaller weighted prediction loss and conditional Wasserstein distance, and assigning balanced λ[t]. (Sec. 4.3). We instantiate these rules with a unified practical framework for solving multi-source transfer learning problems, as shown in Tab 1. We would like to point out that our original theoretical result is based on setting with the available target labels. The proposed algorithm can be applied to unsupervised scenarios under additional assumptions.

4.1. GUIDELINES IN THE REPRESENTATION LEARNING

Motivated by Theorem 2, given a fixed label ratio estimation αt and fixed λ, we should find a representation function g : X → Z and a hypothesis function h : Z × Y → R such that: min g,h t λ[t] Rαt St (h, g) + C 0 t λ[t]E y∼ T (y) W 1 ( Ŝt (z|Y = y) T (z|Y = y)) Explicit Conditional Loss When target label information is available, one can explicitly solve the conditional optimal transport problem with g and h for a given Y = y. However, due to the high computational complexity in solving T × |Y| optimal transport problems, the original form is practically intractable. To address this issue, we propose to approximate the conditional distribution on latent space Z as Gaussian distribution with identical Covariance matrix such that Ŝt (z|Y = y) ≈ N (C y t , Σ) and T (z|Y = y) ≈ N (C y , Σ). Then we have W 1 ( Ŝt (z|Y = y) T (z|Y = y)) ≤ C y t -C y 2 (see Appendix G for details). Intuitively, the approximation term is equivalent to the well known feature mean matching (Sugiyama & Kawanabe, 2012) , which computes the feature centroid of each class (on latent space Z) and aligns them by minimizing their L 2 distance. Implicit Conditional Loss When target label information is not available (e.g. unsupervised DA and partial DA), the explicit matching approach can adopt pseudo-label predicted by the hypothesis h as a surrogate of the true target label. However, in the early stage of the learning process, the pseudo-labels can be unreliable, which can lead to an inaccurate estimate of W 1 ( Ŝ(z|Y = y) T (z|Y = y)). To address this, the following Lemma indicates that estimating the conditional Wasserstein distance is equivalent to estimating the Wasserstein adversarial loss weighted by the label-distribution ratio. Lemma 1. The weighted conditional Wasserstein distance can be implicitly expressed as: t λ[t]E y∼T (y) W 1 (S t (z|Y = y) T (z|Y = y)) = max d1,••• ,d T t λ[t][E z∼St(z) ᾱt (z)d t (z)-E z∼T (z) d t (z)], where ᾱt (z) = 1 {(z,y)∼St} α t (Y = y), and d 1 , . . . , d T : Z → R + are the 1-Lipschitz domain discriminators (Ganin et al., 2016; Arjovsky et al., 2017) . Lemma 1 reveals that instead of using pseudo-labels to estimate the weighted conditional Wasserstein distance, one can train T domain discriminators with weighted Wasserstein adversarial loss, which does not require the pseudo-label of each target sample during the matching. On the other hand, ᾱt can be obtained from αt , which will be elaborated in Sec. 3.2. In practice, we adopt a hybrid approach by linearly combining the explicit and implicit matching strategies for all the scenarios, in which empirical results show its effectiveness.

4.2. ESTIMATE LABEL DISTRIBUTION RATIO αt

Multi-Source Transfer with target labels When the target labels are available, αt can be directly estimated from the data without any assumption and αt → α t can be proved from asymptotic statistics. Unsupervised Multi-Source DA In this scenario, it is impossible to estimate a good αt without imposing any additional assumptions. Following (Zhang et al., 2013; Lipton et al., 2018; Azizzadenesheli et al., 2019; Combes et al., 2020) , we assume that the conditional distributions are aligned between the target and source domains (i.e., S t (z|y) = T (z|y)). Then, we denote St (y), T (y) as the predicted t-source/target label distribution through the hypothesis h, and also define C Ŝt [y, k] = Ŝt [argmax y h(z, y ) = y, Y = k] is the t-source prediction confusion ma- trix. We can demonstrate that if the conditional distribution is aligned, we have T (y) = Tαt (y), with Tαt (Y = y) = Y k=1 C Ŝt [y, k]α t (k) the constructed target prediction distribution from the t-source information. (See Appendix I for the proof). Then we can estimate αt through matching these two distributions by minimizing D KL ( T (y) Tαt (y)), which is equivalent to: min αt - |Y| y=1 T (y) log( |Y| k=1 C Ŝt [y, k]α t (k)) s.t ∀y ∈ Y, αt (y) ≥ 0, |Y| y=1 αt (y) Ŝt (y) = 1 (2) In the aforementioned part, we have assumed the conditional distribution is aligned, which is a feasible requirement in our algorithm, since the goal of g exactly aims at gradually achieving this. In the experiments, we iteratively estimate αt and learn g. Unsupervised Multi-Source Partial DA When supp(T (y)) ⊆ supp(S t (y)), α t is sparse due to the non-overlapped classes. Accordingly, in addition to the assumption of S t (z|y) = T (z|y) as in unsupervised DA, we also impose such prior knowledge by adding a regularizer αt 1 to the objective of Eq. ( 2) to induce sparsity in αt (See Appendix J for more details). In training the neural network, since the non-overlapped classes will be automatically assigned with a small or zero αt , (g, h) will be less affected by the classes with small αt . Our empirical results effectively validate its capability in detecting non-overlapping classes and show significant improvements over other baselines.

4.3. ESTIMATE TASK RELATION COEFFICIENT λ

Inspired by Theorem 1, given fixed αt and (g, h), we estimate λ through optimizing the derived upper bound. min λ t λ[t] Rα t S t (h, g) + C0 t λ[t]E y∼ T (y) W1( T (z|Y = y) Ŝ(z|Y = y)) + C1 T t=1 λ 2 [t] βt s.t ∀t, λ[t] ≥ 0, T t=1 λ[t] = 1 In practice, Rαt St (h, g) is the weighted empirical prediction error and E y∼ T (y) W 1 ( T (z|Y = y) Ŝ(z|Y = y)) is approximated by the dynamic feature centroid distance y T (y) C y t -C y 2 (See Appendix L for details). Thus, solving λ is a standard convex optimization problem.

4.4. ALGORITHM DESCRIPTION

Based on the aforementioned components, we present the description of WADN (Algorithm 1) in the unsupervised scenarios (UDA and Partial DA), which iteratively updates (g, h), αt , and λ. When Compute source confusion matrix for each batch (un-normalized) C Ŝt = #[argmax y h(z, y ) = y, Y = k] (t = 1, . . . , T ) 5: Compute the batched class centroid for source C y t and target C y . 6: Moving Average for update source/target class centroid: ( 1 = 0.7) 7: Update Source class centroid C y t = 1 × C y t + (1 -1 ) × C y t 8: Update Target class centroid C y = 1 × C y + (1 -1 ) × C y 9: Updating g, h, d 1 , . . . , d T (SGD and Gradient Reversal), by solving: min g,h max d 1 ,...,d T t λ[t] Rα t S t (h, g) Classification Loss + C0 t λ[t]E y∼ T (y) C y t -C y 2 Explicit Conditional Loss + (1 -)C0 t λ[t][E z∼ Ŝt (z) ᾱt(z)d(z) -E z∼ T (z) d(z)] Implicit Conditional Loss 10: end for 11: Estimation αt and λ 12: Compute the global(normalized) source confusion matrix C Ŝt = Ŝt [argmax y h(z, y ) = y, Y = k] (t = 1, . . . , T ) 13: Solve α t (denoted as {α t } T t=1 ) by (Sec. 4.2 Unsupervised DA or Partial UDA). 14: Update α t by moving average: α t = 1 × α t + (1 -1 ) × α t 15: Compute the weighted loss and weighted centroid distance, then solve λ (denoted as λ ) from Sec. 4.3. And updating λ by moving average: λ = 1 × λ + (1 -1 ) × λ updating λ and α t , we used package CVXPY to optimize the two standard convex losses after each training epoch, then we updating them by using the moving average. As for WADN under target label information, we did not require pseudo-label and directly compute αt , shown in Appendix L.

5. EXPERIMENTS

In this section, we compare proposed approaches with several baselines for the popular tasks. For all the scenarios, the following baselines are evaluated: (I) Source method applied only labelled source data to train the model. (II) DANN (Ganin et al., 2016) . We follow the protocol of Wen et al. (2020) to merge all the source dataset as a global source domain. (III) MDAN (Zhao et al., 2018) ; (IV) MDMN (Li et al., 2018b) ; (V) M 3 SDA (Peng et al., 2019) adopted maximizing classifier discrepancy (Saito et al., 2018) and (VI) DARN (Wen et al., 2020) . For the conventional multi-source transfer and partial unsupervised multi-source DA, we additionally compare specific baselines. All the baselines are re-implemented in the same network structure for fair comparisons. The detailed network structures, hyper-parameter settings, training details are put in Appendix M. We evaluate the performance on three different datasets: (I) Amazon Review. (Blitzer et al., 2007) It contains four domains (Books, DVD, Electronics, and Kitchen) with positive and negative product reviews. We follow the common data pre-processing strategies as (Chen et al. (2012) ) to form a 5000-dimensional bag-of-words feature. Note that the label distribution in the original dataset is uniform. To enhance the benefits of the proposed approach, we create a label distribution drifted task by randomly dropping 50% negative reviews of all the sources while keeping the target identical. (show in Fig. 3 (a) ). (II) Digits. It consists four digits recognition datasets including MNIST, USPS (Hull, 1994) , SVHN (Netzer et al., 2011) and Synth (Ganin et al., 2016) . We also create a slight label distribution drift for the sources by randomly dropping 50% samples on digits 5-9 and keep target identical. (showed in Fig. (3)(b)). (III) Office-Home Dataset (Venkateswara et al., 2017) . It contains 65 classes for four different domains: Art, Clipart, Product and Real-World. We used the ResNet50 (He et al., 2016) pretrained from the ImageNet in PyTorch as the base network for feature learning and put a MLP for the classification. The label distributions in these four domains are different and we did not manually create a label drift (showed in Fig. 3 (c)).

5.1. UNSUPERVISED MULTI-SOURCE DA

In the unsupervised multi-source DA, we evaluate the proposed approach on all the three datasets. We use a similar hyper-parameter selection strategy as in DANN (Ganin et al., 2016) . All reported results are averaged from five runs. The detailed experimental settings are illustrated in Appendix M. The empirical results are illustrated in Tab. 7, 2 and 3. Since we did not change the target label distribution throughout the whole experiments, then we still use the target accuracy as the metric. We report the means and standard deviations for each approach. The best approaches based on a two-sided Wilcoxon signed-rank test (significance level p = 0.05) are shown in bold. 83.45 The empirical results reveal a significantly better performance (≈ 3%) on different datasets. For understanding the working principles of WADN, we evaluate the performance under different levels of source label shift in Amazon Review dataset (Fig. 1(a) ). The results show strong practical benefits for WADN during a gradual larger label shift. In addition, we visualize the task relations in digits (Fig. 1(b) ) and demonstrate a non-uniform λ, which highlights the importance of properly choosing the most related source rather than simply merging all the data. E.g. when the target domain is SVHN, WADN mainly leverages the information from SYNTH, since they are more semantically similar and MNIST does not help too much for SVHN (observed by Ganin et al. (2016) ). The additional analysis and results can be found in Appendix O.

5.2. MULTI-SOURCE TRANSFER LEARNING WITH LIMITED TARGET SAMPLES

We adopt Amazon Review and Digits in the multi-source transfer learning with limited target samples, which have been widely used. hyper-parameters and training strategies with unsupervised DA. We specifically add a recent baseline RLUS (Konstantinov & Lampert, 2019) and MME (Saito et al., 2019) , which also considered transfer learning with the labeled target. The results are reported in Tabs. 4, 5, which also indicates strong empirical benefits. To show the effectiveness of WADN, we select various portions of labelled samples (1% ∼ 10%) on the target. The results in Fig. 1 (c) on USPS dataset shows a consistently better than the baselines, even in the few target samples.

5.3. PARTIAL UNSUPERVISED MULTI-SOURCE DA

In this scenario, we adopt the Office-Home dataset to evaluate our approach, as it contains large (65) classes. We do not change the source domains and we randomly choose 35 classes from the target. We evaluate all the baselines on the same selected classes and repeat 5 times. All reported results are averaged from 3 different sub-class selections (15 runs in total), showing in Tab.6. (See Appendix M for details.) We additionally compare PADA (Cao et al., 2018) approach by merging all the sources and use one-to-one partial DA algorithm. We adopt the same hyper-parameters and training strategies with unsupervised DA scenario. 83.98 The reported results are also significantly better than the current multi-source DA or one-to-one partial DA approach, which verifies the benefits of WADN: properly estimating αt and assigning proper λ for each source. 

(a)

Alpha(A) Alpha(C) x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Alpha(R)

6. CONCLUSION

In this paper, we proposed a new theoretical principled algorithm WADN (Wasserstein Aggregation Domain Network) to solve the multi-source transfer learning problem under target shift. WADN provides a unified solution for various deep multi-source transfer scenarios: learning with limited target data, unsupervised DA, and partial unsupervised DA. We evaluate the proposed method by extensive experiments and show its strong empirical results. A ADDITIONAL EMPIRICAL RESULTS 2019) proposed an ad-hoc strategy to combine to sources in the few-shot target domains. These ideas are generally data-driven approaches and do not analyze the why the proposed practice can control the generalization error. Label-Partial Transfer Learning Label-Partial can be viewed as a special case of the label-shift. foot_0 Most existing works focus on one-to-one partial transfer learning (Zhang et al., 2018; Chen et al., 2020; Bucci et al., 2019; Cao et al., 2019) by adopting the re-weighting training approach without a formal understanding. In our paper, we first rigorously analyzed this common practice and adopt the label distribution ratio as its weights, which provides a principled approach in this scenario.

Domain Generalization

The domain generalization (DG) resembles multi-source transfer but aims at different goals. A common setting in DG is to learn multiple source but directly predict on the unseen target domain. The conventional DG approaches generally learn a distribution invariant features (Balaji et al., 2018; Saenko et al., 2010; Motiian et al., 2017; Ilse et al., 2019) or conditional distribution invariant features (Li et al., 2018a; Akuzawa et al., 2019) . However, our theoretical results reveal that in the presence of label shift (i.e α t (y) = 1) and outlier tasks then learning conditional or marginal invariant features can not guarantee a small target risk. Our theoretical result enables a formal understanding about the inherent difficulty in DG problems.

Few-Shot Learning

The few-shot learning (Finn et al., 2017; Snell et al., 2017; Sung et al., 2018) can be viewed as a very specific scenario of multi-source transfer learning. We would like to point out the differences between the few-shot learning and our paper. (1) Few-shot learning generally involves a very large set of source domains T 1 and each domain consists a modest number of observations N St . In our paper, we are interested in the a modest number of source domains T but each source domain including a sufficient large number of observations (N St 1). (2) In the target domain, the few-shot setting generally used K-samples (K is very small) for each class for the fine-tuning. We would like to point out this setting generally violates our theoretical assumption. In By denoting α(y) = T (y) S(y) , then we have: E y∼T (y) E y∼T (x|y) (h(x, y)) = E y∼S(y) α(y)E x∼T (x|y) (h(x, y)) Then we aim to upper bound E x∼T (x|y) (h(x, y)). For any fixed y, E x∼T (x|y) (h(x, y)) -E x∼S(x|y) (h(x, y)) ≤ | x∈X (h(x, y))d(T (x|y) -S(x|y))| Then according to the Kantorovich-Rubinstein duality, for any distribution coupling γ ∈ Π(T (x|y), S(x|y)), then we have: = inf γ | X ×X (h(x p , y)) -(h(x q , y))dγ(x p , x q )| ≤ inf γ X ×X | (h(x p , y)) -(h(x q , y))|dγ(x p , x q ) ≤ L inf γ X ×X |h(x p , y)) -h(x q , y)|dγ(x p , x q ) ≤ LK inf γ X ×X x p -x q 2 dγ(x p , x q ) = LKW 1 (T (x|Y = y) S(x|Y = y)) The first inequality is obvious; and the second inequality comes from the assumption that is L-Lipschitz; the third inequality comes from the hypothesis is K-Lipschitz w.r.t. the feature x (given the same label), i.e. for ∀Y = y, h(x 1 , y) -h(x 2 , y) 2 ≤ K x 1 -x 2 2 . Then we have: R T (h) ≤ E y∼S(y) α(y)[E x∼S(x|y) (h(x, y)) + LKW 1 (T (x|y) S(x|y))] = E (x,y)∼S α(y) (h(x, y)) + LKE y∼T (y) W 1 (T (x|Y = y) S(x|Y = y)) = R α S (h) + LKE y∼T (y) W 1 (T (x|Y = y) S(x|Y = y)) Supposing each source S t we assign the weight λ[t] and label distribution ratio α t (y) = T (y) St(y) , then by combining this T source target pair, we have: R T (h) ≤ t λ[t]R αt St (h) + LK t λ[t]E y∼T (y) W 1 (T (x|Y = y) S t (x|Y = y)) Then we will prove Theorem 1 from this result, we will derive the non-asymptotic bound, estimated from the finite sample observations. Supposing the empirical label ratio value is αt , then for any simplex λ we can prove the high-probability bound.

E.1 BOUNDING THE EMPIRICAL AND EXPECTED PREDICTION RISK

Proof. We first bound the first term, which can be upper bounded as: sup h | t λ[t]R α t S t (h)- t λ[t] Rα t S t (h)| ≤ sup h | t λ[t]R α t S t (h) - t λ[t] Rα t S t (h)| (I) + sup h | t λ[t] Rα t S t (h) - t λ[t] Rα t S t (h)| (II) Bounding term (I) According to the McDiarmid inequality, each item changes at most | 2λ[t]αt(y) N S t |. Then we have: P ((I) -E(I) ≥ t) ≤ exp( -2t 2 T t=1 4 βtN λ 2 [t]α t (y) 2 2 ) = δ By substituting δ, at high probability 1 -δ we have: (I) ≤ E(I) + L max d sup ∞ T t=1 λ[t] 2 β t log(1/δ) 2N Where L max = sup h∈H (h) and N = Bounding E sup(I), the expectation term can be upper bounded as the form of Rademacher Complexity: E(I) ≤ 2E σ E ŜT 1 sup h T t=1 λ[t] (xt,yt)∈ Ŝt 1 T N (α t (y) (h(x t , y t )) ≤ 2 t λ[t]E σ E ŜT 1 sup h (xt,yt)∈ Ŝt 1 T N (α t (y) (h(x t , y t )) ≤ 2 sup t E σ E Ŝt sup h (xt,yt)∈ Ŝt 1 T N [α t (y) (h(x t , y t ))] = sup t 2R t ( , H) = 2 R( , H) Where R( , H) = sup t R t ( , H) = sup t sup h∼H E Ŝt,σ (xt,yt)∈ Ŝt 1 T N [α t (y) (h(x t , y t ))], represents the Rademacher complexity w.r.t. the prediction loss , hypothesis h and true label distribution ratio α t . Therefore with high probability 1 -δ, we have: sup h | t λ[t]R αt S (h) - t λ[t] Rαt S (h)| ≤ R( , h) + L max d sup ∞ T t=1 λ[t] 2 β t log(1/δ) 2N Bounding Term (II) For all the hypothesis h, we have: | t λ[t] Rαt St (h) - t λ[t] Rαt St (h)| = | t λ[t] 1 N St N S t i (α(y(i)) -α(y(i))) (h)| = t λ[t] 1 N St | |Y| y (α(Y = y) -α(Y = y)) ¯ (Y = y)| Where ¯ (Y = y) = N S t i (h(x i , y i = y)) , represents the cumulative error, conditioned on a given label Y = y. According to the Holder inequality, we have: t λ[t] 1 N St | |Y| y (α t (Y = y) -αt (Y = y)) ¯ (Y = y)| ≤ t λ[t] 1 N St α t -αt 2 ¯ (Y = y) 2 ≤ L max t λ[t] α t -αt 2 ≤ L max sup t α t -αt 2 Therefore, ∀h ∈ H, with high probability 1 -δ we have: t λ[t]R αt S (h) ≤ t λ[t] Rαt S (h)+2 R( , h)+L max d sup ∞ T t=1 λ[t] 2 β t log(1/δ) 2N +L max sup t α t -α t 2 E.2 BOUNDING EMPIRICAL WASSERSTEIN DISTANCE Then we need to derive the sample complexity of the empirical and true distributions, which can be decomposed as the following two parts. For any t, we have: E y∼T (y) W 1 (T (x|Y = y) S t (x|Y = y)) -E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) ≤ E y∼T (y) W 1 (T (x|Y = y) S t (x|Y = y)) -E y∼T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) (I) + E y∼T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) -E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) (II) Bounding (I) We have: E y∼T (y) W 1 (T (x|Y = y) S t (x|Y = y)) -E y∼T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) = y T (y) W 1 (T (x|Y = y) S t (x|Y = y)) -W 1 ( T (x|Y = y) Ŝt (x|Y = y) ≤ | y T (y)| sup y W 1 (T (x|Y = y) S t (x|Y = y)) -W 1 ( T (x|Y = y) Ŝt (x|Y = y) = sup y W 1 (T (x|Y = y) S t (x|Y = y)) -W 1 ( T (x|Y = y) Ŝt (x|Y = y) ≤ sup y [W 1 (S t (x|Y = y) Ŝt (x|Y = y)) + W 1 ( Ŝt (x|Y = y) T (x|Y = y)) + W 1 ( T (x|Y = y) T (x|Y = y)) -W 1 ( T (x|Y = y) Ŝt (x|Y = y))] = sup y W 1 (S t (x|Y = y) Ŝt (x|Y = y)) + W 1 ( T (x|Y = y) T (x|Y = y)) The first inequality holds because of the Holder inequality. As for the second inequality, we use the triangle inequality of Wasserstein distance. W 1 (P Q) ≤ W 1 (P P 1 ) + W 1 (P 1 P 2 ) + W 1 (P 2 Q). According to the convergence behavior of Wasserstein distance (Weed et al., 2019) , with high probability ≥ 1 -2δ we have: W 1 (S t (x|Y = y) Ŝt (x|Y = y)) + W 1 ( T (x|Y = y) T (x|Y = y)) ≤ κ(δ, N y St , N y T ) Where k(δ, N y St , N y T ) = C t,y (N y St ) -st,y + C y (N y T ) -sy + 1 2 log( 2 δ )( 1 N y S t + 1 N y t ), where N y

St

is the number of Y = y in source t and N y T is the number of Y = y in target distribution. C t,y , C y s t,y > 2, s y > 2 are positive constant in the concentration inequality. This indicates the convergence behavior between empirical and true Wasserstein distance. If we adopt the union bound (over all the labels) by setting δ ← δ/|Y|, then with high probability ≥ 1 -2δ, we have: sup y W 1 (S(x|Y = y) Ŝ(x|Y = y)) + W 1 ( T (x|Y = y) T (x|Y = y)) ≤ κ(δ, N y St , N y T ) where κ(δ, N y St , N y T ) = C t,y (N y St ) -st,y + C y (N y T ) -sy + 1 2 log( 2|Y| δ )( 1 N y S t + 1 N y T ) Again by adopting the union bound (over all the tasks) by setting δ ← δ/T , with high probability ≥ 1 -2δ, we have: t λ[t]E y∼T (y) W1(T (x|Y = y) S(x|Y = y))- t λ[t]E y∼T (y) W1( T (x|Y = y) Ŝ(x|Y = y)) ≤ sup t κ(δ, N y S t , N y T ) Where κ(δ, N y St , N y T ) = C t,y (N y St ) -st,y + C y (N y T ) -sy + 1 2 log( 2T |Y| δ )( 1 N y S t + 1 N y T ). Bounding (II) We can bound the second term:  E y∼T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) -E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) ≤ sup y W 1 ( T (x|Y = y) Ŝt (x|Y = y))| y T (y) -T (y)| ≤ C t max | y T | y T (y) -T (y)| ≤ E T | y T (y) -T (y)| + log(1/δ) 2N T = 2E σ E T y σ T (y) + log(1/δ) 2N T Then we bound E σ E T y σ T (y). We use the properties of Rademacher complexity [Lemma 26.11, (Shalev-Shwartz & Ben-David, 2014) ] and notice that T (y) is a probability simplex, then we have: E σ E T y σ T (y) ≤ 2 log(2|Y|) N T Then we have | y T (y) -T (y)| ≤ 2 log(2|Y|) N T + log(1/δ) 2N T Then using the union bound and denoting δ ← δ/T , with high probability ≥ 1 -δ and for any simplex λ, we have: t λ[t]E y∼T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) ≤ t λ[t]E y∼ T (y) W 1 ( T (x|Y = y) Ŝt (x|Y = y)) C max ( 2 log(2|Y|) N T + log(T /δ) 2N T ) where C max = sup t C t max . Combining together, we can derive the PAC-Learning bound, which is estimated from the finite samples (with high probability 1 -4δ): R T (h) ≤ t λ t Rαt St (h) + LH t λ t E y∼ T (y) W 1 ( T (x|Y = y) Ŝ(x|Y = y)) + L max d sup ∞ T t=1 λ 2 t β t log(1/δ) 2N + 2 R( , h) + L max sup t α t -αt 2 + sup t κ(δ, N y St , N y T ) + C max ( 2 log(2|Y|) N T + log(T /δ) 2N T ) Then we denote Comp(N S1 , . . . , N T , δ) = 2 R( , h) + sup t κ(δ, N y St , N y T ) + C max ( 2 log(2|Y|) N T + log(T /δ) 2N T ) as the convergence rate function that decreases with larger N S1 , . . . , N T . Bedsides, R( , h) = sup t R t ( , H) is the re-weighted Rademacher complexity. Given a fixed hypothesis with finite VC dimensionfoot_1 , it can be proved R( , h) = min N S 1 ,...,N S T O( (Shalev-Shwartz & Ben-David, 2014) . 1 N S t ) i.e

F PROOF OF THEOREM 2

We first recall the stochastic feature representation g such that g : X → Z and scoring hypothesis h h : Z × Y → R and the prediction loss with : R → R. 3 Proof. The marginal distribution and conditional distribution w.r.t. latent variable Z that are induced by g, which can be reformulated as: S(z) = x g(z|x)S(x)dx S(z|y) = x g(z|x)S(x|Y = y)dx In the multi-class classification problem, we additionally define the following distributions: µ k (z) = S(Y = k, z) = S(Y = k)S(z|Y = k) π k (z) = T (Y = k, z) = T (Y = k)T (z|Y = k) Based on (Nguyen et al., 2009) and g(z|x) is a stochastic representation learning function, the loss conditioned a fixed point (x, y) w.r.t. h and g is E z∼g(z|x) (h(z, y)). Then taking the expectation over the S(x, y) we have: 4 R S (h, g) = E (x,y)∼S(x,y) E z∼g(z|x) (h(z, y)) = |Y| k=1 S(y = k) x S(x|Y = k) z g(z|x) (h(z, y = k))dzdx = |Y| k=1 S(y = k) z [ x S(x|Y = k)g(z|x)dx] (h(z, y = k))dz = |Y| k=1 S(y = k) z S(z|Y = k) (h(z, y = k))dz = |Y| k=1 z S(z, Y = k) (h(z, y = k))dz = |Y| k=1 z µ k (z) (h(z, y = k))dz Intuitively, the expected loss w.r.t. the joint distribution S can be decomposed as the expected loss on the label distribution S(y) (weighted by the labels) and conditional distribution S(•|y) (real valued conditional loss). Then the expected risk on the S and T can be expressed as: R S (h, g) = |Y| k=1 z (h(z, y = k))µ k (z)dz R T (h, g) = |Y| k=1 z (h(z, y = k))π k (z)dz 3 Note this definition is different from the conventional binary classification with binary output, and it is more suitable in the multi-classification scenario and cross entropy loss (Hoffman et al., 2018a) . For example, if we define l = -log(•) and h(z, y) ∈ (0, 1) as a scalar score output. Then (h(z, y)) can be viewed as the cross-entropy loss for the neural-network. 4 An alternative understanding is based on the Markov chain. In this case it is a DAG with Y S(y|x) ← ---- X g -→ Z, X S(y|x) ----→ Y h -→ S h ← -Z g ← -X. (S is the output of the scoring function). Then the expected loss over the all random variable can be equivalently written as P(x, y, z, s) (s) d(x, y, z, s) = P(x)P(y|x)P(z|x)P(s|z, y) (s) = P(x, y)P(z|x)P(s|z, y) (s)d(x, y)d(z)d(s). Since the scoring S is determined by h(x, y), then P(s|y, z) = 1. According to the definition we have P(z|x) = g(z|x), P(x, y) = S(x, y), then the loss can be finally expressed as E S(x,y) E g(z|x) (h(z, y)) By denoting α(y) = T (y) S(y) , we have the α-weighted loss: R α S (h, g) =T (Y = 1) z (h(z, y = 1))S(z|Y = 1) + T (Y = 2) z (h(z, y = 2))S(z|Y = 2) + • • • + T (Y = k) z (h(z, y = k))S(z|Y = k)dz Then we have: R T (h, g) -R α S (h, g) ≤ k T (Y = k) z (h(z, y = k))d|S(z|Y = k) -T (z|Y = k)| Under the same assumption, we have the loss function (h(z, Y = k)) is KL-Lipschitz w.r.t. the cost • 2 (given a fixed k). Therefore by adopting the same proof strategy (Kantorovich-Rubinstein duality) in Lemma 2, we have ≤ KLT (Y = 1)W 1 (S(z|Y = 1) T (z|Y = 1)) + • • • + KLT (Y = k)W 1 (S(z|Y = k) T (z|Y = k)) = KLE y∼T (y) W 1 (S(z|Y = y) T (z|Y = y)) Therefore, we have: R T (h, g) ≤ R α S (h, g) + LKE y∼T (y) W 1 (S(z|Y = y) T (z|Y = y)) Based on the aforementioned result, we have ∀t = 1, . . . , T and denote S = S t and α(y) = α t (y) = T (y)/S t (y): λ[t]R T (h, g) ≤ λ[t]R αt St (h, g) + LKλ[t]E y∼T (y) W 1 (S t (z|Y = y) T (z|Y = y)) Summing over t = 1, . . . , T , we have: R T (h, g) ≤ T t=1 λ[t]R αt St (h, g) + LK T t=1 λ[t]E y∼T (y) W 1 (S t (z|Y = y) T (z|Y = y)) G APPROXIMATION W 1 DISTANCE According to Jensen inequality, we have W 1 ( Ŝt (z|Y = y) T (z|Y = y)) ≤ [W 2 ( Ŝt (z|Y = y) T (z|Y = y))] 2 Supposing Ŝt (z|Y = y) ≈ N (C y t , Σ) and T (z|Y = y) ≈ N (C y , Σ), then we have: [W 2 ( Ŝt (z|Y = y) T (z|Y = y)] 2 = C y t -C y 2 2 + Trace(2Σ -2(ΣΣ) 1/2 ) = C y t -C y 2 2 We would like to point out that assuming the identical covariance matrix is more computationally efficient during the matching. This is advantageous and reasonable in the deep learning regime: we adopted the mini-batch (ranging from 20-128) for the neural network parameter optimization, in each mini-batch the samples of each class are small, then we compute the empirical covariance/variance matrix will be surely biased to the ground truth variance and induce a much higher complexity to optimize. By the contrary, the empirical mean is unbiased and computationally efficient, we can simply use the moving the moving average to efficiently update the estimated mean value (with a unbiased estimator). The empirical results verify the effectiveness of this idea.

H PROOF OF LEMMA 1

For each source S t , by introducing the duality of Wasserstein-1 distance, for y ∈ Y, we have: W 1 (S t (z|y) T (z|y)) = sup d L ≤1 E z∼St(z|y) d(z) -E z∼T (z|y) d(z) = sup d L ≤1 z S t (z|y)d(z) - z T (z|y)d(z) = 1 T (y) sup d L ≤1 T (y) S t (y) z S t (z, y)d(z) - z T (z, y)d(z) Then by defining ᾱt (z) = 1 {(z,y)∼St} T (Y =y) St(Y =y) = 1 {(z,y)∼St} α t (Y = y), we can see for each pair observation (z, y) sampled from the same distribution, then ᾱt (Z = z) = α t (Y = y). Then we have: y T (y)W 1 (S t (z|y) T (z|y)) = y sup d L ≤1 { z α t (y)S t (z, y)d(z) - z T (z, y)d(z)} = sup d L ≤1 z ᾱt (z)S t (z)d(z) - z T (z)d(z) = sup d L ≤1 E z∼St(z) ᾱt (z)d(z) -E z∼T (z) d(z) We propose a simple example to understand ᾱt : supposing three samples in S t = {(z 1 , Y = 1), (z 2 , Y = 1), (z 3 , Y = 0)} then ᾱt (z 1 ) = ᾱt (z 2 ) = α t (1) and ᾱt (z 3 ) = α t (0). Therefore, the conditional term is equivalent to the label-weighted Wasserstein adversarial learning. We plug in each source domain as weight λ[t] and domain discriminator as d t , we finally have Lemma 1.

I DERIVE THE LABEL RATIO LOSS

We suppose the representation learning aims at matching the conditional distribution such that T (z|y) ≈ S t (z|y),∀t, then we suppose the predicted target distribution as T (y). By simplifying the notation, we define f (z) = argmax y h(z, y) the most possible prediction label output, then we have: T (y) = Y k=1 T (f (z) = y|Y = k)T (Y = k) = Y k=1 S t (f (z) = y|Y = k)T (Y = k) = Y i=1 S t (f (z) = y, Y = k)α t (k) = Tαt (y) The first equality comes from the definition of target label prediction distribution, T (y) = E T (z) 1{f (z) = y} = T (f (z) = y) = Y k=1 T (f (z) = y, Y = k) = Y k=1 T (f (z) = y|Y = k)T (Y = k). The second equality T (f (z) = y|Y = k) = S t (f (z) = y|Y = k) holds since ∀t, T (z|y) ≈ S t (z|y), then for the shared hypothesis f , we have T (f (z) = y|Y = k) = S t (f (z) = y|Y = k). The term S t (f (z) = y, Y = k) is the (expected) source prediction confusion matrix, and we denote its empirical (observed) version as Ŝt (f (z) = y, Y = k). Based on this idea, in practice we want to find a αt to match the two predicted distribution T and Tαt . If we adopt the KL-divergence as the metric, we have: min αt D KL ( T Tαt ) = min αt E y∼ T log( T (y) Tαt (y) ) = min αt -E y∼ T log( Tαt (y)) = min αt - y T (y) log( Y k=1 S t (f (z) = y, Y = k)α t (k)) We should notice the nature constraints of label ratio: {α t (y) ≥ 0, y αt (y) Ŝt (y) = 1}. Based on this principle, we proposed the optimization problem to estimate each label ratio. We adopt its empirical counterpart, the empirical confusion matrix C Ŝt The key difference between multi-conventional and partial unsupervised DA is the estimation step of αt . In fact, we only add a sparse constraint for estimating each αt : min αt - |Y| y=1 T (y) log( |Y| k=1 C Ŝt [y, k]α t (k)) + C 2 αt 1 s.t. ∀y ∈ Y, αt (y) ≥ 0, y αt (y) Ŝt (y) = 1 Where C 2 is the hyper-parameter to control the level of target label sparsity, to estimate the target label distribution. In the paper, we denote C 2 = 0.1.

K EXPLICIT AND IMPLICIT CONDITIONAL LEARNING

Inspired by Theorem 2, we need to learn the function g : X → Z and h : Z × Y → R to minimize: min g,h t λ[t] Rαt St (h, g) + C 0 t λ[t]E y∼ T (y) W 1 ( Ŝt (z|Y = y) T (z|Y = y)) This can be equivalently expressed as: min g,h t λ[t] Rαt St (h, g) + C 0 t λ[t]E y∼ T (y) W 1 ( Ŝt (z|Y = y) T (z|Y = y)) + (1 -)C 0 t λ[t]E y∼ T (y) W 1 ( Ŝt (z|Y = y) T (z|Y = y)) Due to the explicit and implicit approximation of conditional distance, we then optimize an alternative form: min g,h max d1,...,d T t λ[t] Rαt St (h, g) Classification Loss + C 0 t λ[t]E y∼ T (y) C y t -C y 2 Explicit Conditional Loss + (1 -)C 0 t λ[t][E z∼ Ŝt(z) ᾱt (z)d(z) -E z∼ T (z) d(z)] Implicit Conditional Loss Based on the equivalence form, our approach proposed a theoretical principled way to tuning its weights. In the paper, we assume = 0.5. • T (y) empirical target label distribution. (In the unsupervised DA scenarios, we approximate it by predicted target label distribution T (y).) Gradient Penalty In order to enforce the Lipschitz property of the statistic critic function, we adopt the gradient penalty term (Gulrajani et al., 2017) . More concretely, given two samples z s ∼ S t (z) and z t ∼ T (z) we generate an interpolated sample z int = ξz s + (1 -ξ)z t with ξ ∼ Unif[0, 1]. Then we add a gradient penalty ∇d(z int ) 2 2 as a regularization term to control the Lipschitz property w.r.t. the discriminator d 1 , • • • , d T .

L ALGORITHM DESCRIPTIONS

We propose a detailed pipeline of the proposed algorithm in the following, shown in Algorithm 2 and 3. As for updating λ and α t , we iteratively solve the convex optimization problem after each training epoch and updating them by using the moving average technique. For solving the λ and α t , we notice that frequently updating these two parameters in the mini-batch level will lead to an instability result during the training. 5 As a consequence, we compute the accumulated confusion matrix, weighted prediction risk, and conditional Wasserstein distance for the whole training epoch and then solve the optimization problem. We use CVXPY to optimize the two standard convex losses. 6Comparison with different time and memory complexity. We discuss the time and memory complexity of our approach. Moving Average for update source/target class centroid: (We set 1 = 0.7) (2) (Or Eq.( 5)) in the partial scenario). 14: Update α t by moving average: α t = 1 × α t + (1 -1 ) × α t 15: Compute the weighted loss and weighted centroid distance, then solve λ (denoted as λ ) from Sec. 2.3. Moving Average for update source/target class centroid: (We set 1 = 0.7) 5: Source class centroid update C y t = 1 × C y t + (1 -1 ) × C y t 6: Target class centroid update C y = 1 × C y + (1 -1 ) × C y 7: Updating g, h, d 1 , . . . , d T (SGD and Gradient Reversal), based on Eq.( 6). 8: end for We used the amazon review dataset (Blitzer et al., 2007) . It contains four domains (Books, DVD, Electronics, and Kitchen) with positive (label "1") and negative product reviews (label "0"). The data size is 6465 (Books), 5586 (DVD), 7681 (Electronics), and 7945 (Kitchen). We follow the common data pre-processing strategies Chen et al. (2012) : use the bag-of-words (BOW) features then extract the top-5000 frequent unigram and bigrams of all the reviews. We also noticed the original data-set are label balanced D(y = 0) = D(y = 1). To enhance the benefits of the proposed approach, we create a new dataset with label distribution drift. Specifically, in the experimental settings, we randomly drop 50% data with label "0" (negative reviews) for all the source data while keeping the target identical, showing in Fig ( 5 ). 1. Feature extractor: ResNet50 (He et al., 2016) 12). We observe that in task Book, DVD, Electronic, and Kitchen, the results are significantly better under a large label-shift. In the initialization with almost no label shift, the state-of-the-art DARN illustrates a slightly better (< 1%) result.

O.2 ADDITIONAL ANALYSIS ON AMAZON DATASET

We present two additional results to illustrate the working principles of WADN, showing in Fig. ( 13) and ( 14). We visualize the evolution of λ between DARN and WADN, which both used theoretical principled approach to estimate λ. We observe that in the source shifted data, DARN shows an inconsistent estimator of λ Fig. ( 13). This is different from the observation of Wen et al. (2020) . We think it may in the conditional and label distribution shift problem, using RS (h(z)) + Discrepancy(S(z), T (z)) to update λ is unstable. In contrast, WADN illustrates a relative consistent estimator of λ under the source shifted data. In addition, WARN gradually and correctly estimates the unbalanced source data and assign higher wights α t for label y = 0 (first row of Fig.( 14)). These principles in WADN jointly promote significantly better results. Since we drop digits 5 -9 on source domain, therefore, α t (y), y ∈ [5, 9] will be assigned with a relative higher value. Besides, when fewer classes are selected, the accuracy in DANN, PADA, and DARN is not drastically dropping but maintaining a relatively stable result. We think the following possible reasons: • The reported performances are based on the average of different selected sub-classes rather than one sub-class selection. From the statistical perspective, if we take a close look at the variance, the results in DANN are much more unstable (higher std) inducing by the different samplings. Therefore, the conventional domain adversarial training is improper for handling the partial transfer since it is not reliable and negative transfer still occurs. • In multi-source DA, it is equally important to detect the non-overlapping classes and find the most similar sources. Comparing the baselines that only focus on one or two principles shows the importance of unified principles in multi-source partial DA. • We also observe that in the Real-World dataset, the DANN improves the performance by a relatively large value. This is due to the inherent difficultly of the learning task itself. In fact, the Real-World domain illustrates a much higher performance compared with other domains. According to the Fano lower bound, a task with smaller classes is generally easy to learn. It is possible the vanilla approach showed improvement but still with a much higher variance. Fig ( 17), (18) showed the estimated αt with different selected classes. The results validate the correctness of WADN in estimating the label distribution ratio. 



Since supp(T (y)) ⊆ supp(St(y)) then we naturally have T (y) = St(y). If the hypothesis is the neural network, the Rademacher complexity can still be bounded analogously. In the label distribution shift scenarios, the mini-batch datasets are highly labeled imbalanced. If we evaluate αt over the mini-batch, it can be computationally expensive and unstable. The optimization problem w.r.t. αt and λ is not large scale, then using the standard convex solver is fast and accurate. Alpha(R) (b) Target: Clipart Alpha(A) Alpha(C)



Wasserstein Aggregation Domain Network (unsupervised scenarios, one iteration) Require: Labeled source samples Ŝ1 , . . . , ŜT , Target samples T Ensure: Label distribution ratio αt , task relation simplex λ. Feature Function g, Classifier h, Domain critic function d 1 , . . . , d T , class centroid for source C y t and target C y (∀t = [1, T ], y ∈ Y). 1: DNN Parameter Training Stage (fixed α t and λ) 2: for mini-batch of samples (x S1 , y S1 ) ∼ Ŝ1 , . . . , (x S T , y S T ) ∼ ŜT , (x T ) ∼ T do 3: Predict target pseudo-label ȳT = argmax y h(g(x T ), y) 4:

Figure 1: (a) Unsupervised DA with Amazon Review dataset. Accuracy under different levels of shifted sources (higher dropping rate means larger label shift). The results are averaged on all target domains. See the results for each task in Fig. (12). (b) Visualization of λ in unsupervised DA, each row corresponds to one target domain. (c) Transfer learning with limited target labels in USPS. The performance of WADN is consistently better under different target samples (smaller portion indicates fewer target samples).

Figure 2: Analysis on Partial DA of target Product. (a) Performance (mean ± std) of different selected classes on the target; (b) We select 15 classes and visualize estimated αt (the bar plot). The "X" along the x-axis represents the index of dropped 50 classes. The red curves are the true label distribution ratio. See Appendix P for additional results and analysis.

St the total source observations and β t = N S t N the frequency ratio of each source. And d sup ∞ = max t=1,...,T d ∞ (T (y) S(y)) = max t=1,...,T max y∈[1,Y] α t (y), the maximum true label shift value (constant).

y) -T (y)| Where C t max = sup y W 1 ( T (x|Y = y) Ŝ(x|Y = y)) is a positive and bounded constant. Then we need to bound | y T (y)-T (y)|, by adopting MicDiarmid's inequality, we have at high probability 1 -δ:

Figure 4: Network Structure of Proposed Approach. It consists three losses: the weighted Classification losses; the centroid matching for explicit conditional matching; the weighted adversarial loss for implicit conditional matching, showed in Eq. (6)

zt,yt)∼ Ŝt 1 {yt=y} z t the centroid of label Y = y in source S t . • C y = (zt,yp)∼ T 1 {yp=y} z t the centroid of pseudo-label Y = y p in target S t . (If it is the unsupervised DA scenarios). • ᾱt (z) = 1 {(z,y)∼St} αt (Y = y), namely if each pair observation (z, y) from the distribution, then ᾱt (Z = z) = αt (Y = y). • d 1 , • • • , d T are domain discriminator (or critic function) restricted within 1-Lipschitz function. • ∈ [0, 1] is the adjustment parameter in the trade-off of explicit and implicit learning.

Time complexity: In computing each batch we need to compute T re-weighted loss, T domain adversarial loss and T explicit conditional loss. Then our computational complexity is still (O)(T ) during the mini-batch training, which is comparable with recent SOTA such as MDAN and DARN. In addition, after each training epoch we need to estimate α t and λ, which can have time complexity O(T |Y|) with each epoch. (If we adopt SGD to solve these two convex problems). Therefore, the our proposed algorithm is time complexity O(T |Y|). The extra Y term in time complexity is due to the approach of label shift in the designed algorithm. Memory Complexity: Our proposed approach requires O(T ) domain discriminator and O(T |Y|) class-feature centroids. By the contrary, MDAN and DARN require O(T ) domain discriminator and M3SDA and MDMN require O(T 2 ) domain discriminators. Since our class-feature centroids are defined in the latent space (z), then the memory complexity of the class-feature centroids can be much smaller than domain discriminators. Algorithm 2 Wasserstein Aggregation Domain Network (unsupervised scenarios, one iteration) Require: Labeled source samples Ŝ1 , . . . , ŜT , Target samples T Ensure: Label distribution ratio αt and task relation simplex λ. Feature Learner g, Classifier h, Statistic critic function d 1 , . . . , d T , class centroid for source C y t and target C y (∀t = [1, T ], y ∈ Y).

Stage (fixed α t and λ) 2: for mini-batch of samples (x S1 , y S1 ) ∼ Ŝ1 , . . . , (x S T , y S T ) ∼ ŜT , (x T ) ∼ T do 3: Predict target pseudo-label ȳT = argmax y h(g(x T ), y) 4: Compute source confusion matrix for each batch (un-normalized) C Ŝt = #[argmax y h(z, y ) = y, Y = k] (t = 1, . . . , T ) 5:Compute the batched class centroid for source C y t and target C y .6:

16: Updating λ by moving average: λ = 0.8 × λ + 0.2 × λ Algorithm 3 Wasserstein Aggregation Domain Network (Limited Target Data, one iteration) Require: Labeled source samples Ŝ1 , . . . , ŜT , Target samples T , Label shift ratio α t Ensure: Task relation simplex λ. Feature Learner g, Classifier h, Statistic critic function d 1 , . . . , d T , class centroid for source C y t and target C y (∀t = [1, T ], y ∈ Y). 1: DNN Parameter Training Stage (fixed λ) 2: for mini-batch of samples (x S1 , y S1 ) ∼ Ŝ1 , . . . , (x S T , y S T ) ∼ ŜT , (x T ) ∼ T do 3: Compute the batched class centroid for source C y t and target C y .4:

λ by Sec. 2.3. (denoted as λ ) 11: Updating λ by moving average: λ = 1 × λ + (1 -1 ) × λ M DATASET DESCRIPTION AND EXPERIMENTAL DETAILS M.1 AMAZON REVIEW DATASET

Figure 10: Neural Network Structure in the Office-Home

Figure14: Amazon Dataset. WADN approach: evolution of αt during the training. Darker indicates higher Value. Since we drop y = 0 in the sources, then the true α t (0) > 1 will be assigned with higher value.

Practical principles under different scenarios

Unsupervised DA: Accuracy (%) on the Source-Shifted Digits. ±2.24 69.13 ±1.56 79.77 ±1.69 86.50 ±1.59 80.81 MDMN 87.31 ±1.88 69.84 ±1.59 80.27 ±0.88 86.61 ±1.41 81.00 M 3 SDA 87.22 ±1.70 68.89 ±1.93 80.01 ±1.77 86.39 ±1.68 80.87 DARN 86.98 ±1.29 68.59 ±1.79 80.68 ±0.61 86.85 ±1.78 80.78 WADN 89.07 ±0.72 71.66 ±0.77 82.06 ±0.89 90.07 ±1.10 83.22 Full TAR 98.70 ±0.15 85.20 ±0.09 95.10 ±0.14 96.64 ±0.13 Full TAR 76.17 ±0.16 79.37 ±0.22 90.60 ±0.24 87.65 ±0.18

Multi-source Transfer: Accuracy (%) on Source-Shifted Amazon Review Tar 72.59 ±1.89 73.02 ±1.84 81.59 ±1.58 77.03 ±1.73

Multi-source Transfer: Accuracy (%) on the Source-Shifted Digits

Unsupervised Partial DA: Accuracy (%) on Office-Home (#Source: 65, #Target: 35) ±1.42 49.79 ±1.14 68.10 ±1.33 78.24 ±0.76 61.67 DANN 53.86 ±2.23 52.71 ±2.20 71.25 ±2.44 76.92 ±1.21 63.69 MDAN 67.56 ±1.39 65.38 ±1.30 81.49 ±1.92 83.44 ±1.01 74.47 MDMN 68.13 ±1.08 65.27 ±1.93 81.33 ±1.29 84.00 ±0.64 74.68 M 3 SDA 65.10 ±1.97 61.80 ±1.99 76.19 ±2.44 79.14 ±1.51 70.56 DARN 71.53 ±0.63 69.31 ±1.08 82.87 ±1.56 84.76 ±0.57 77.12 PADA 74.37 ±0.84 69.64 ±0.80 83.45 ±1.13 85.64 ±0.39 78.28 WADN 80.06 ±0.93 75.90 ±1.06 89.55 ±0.72 90.40 ±0.39

Unsupervised DA: Accuracy (%) on Source-Shifted Amazon Review. ±2.31 67.81 ±2.46 80.96 ±0.77 75.67 ±1.96 73.30 MDMN 70.56 ±1.05 69.64 ±0.73 82.71 ±0.71 77.05 ±0.78 74.99 M 3 SDA 69.09 ±1.26 68.67 ±1.37 81.34 ±0.66 76.10 ±1.47 73.79 DARN 71.21 ±1.16 68.68 ±1.12 81.51 ±0.81 77.71 ±1.09 74.78 WADN 73.72 ±0.63 79.64 ±0.34 84.64 ±0.48 83.73 ±0.50 80.43 Full TAR 84.10 ±0.13 83.68 ±0.12 86.11 ±0.32 88.72 ±0.14

Estimation αt and λ 12: Compute the global(normalized) source confusion matrix C Ŝt = Ŝt [argmax y h(z, y ) = y, Y = k] (t = 1, . . . , T ) 13: Solve α t (denoted as {α t } T t=1 ) by Equation

, 2. Task prediction: with 3 fully connected layers. 'layer1': 'fc': [*, 256], 'batch normalization', 'act fn': 'Leaky relu', 'layer2': 'fc': [256, 256], 'batch normalization', 'act fn': 'Leaky relu', 'layer3': 'fc': [256, 65], 3. Domain Discriminator: with 3 fully connected layers. reverse gradient() 'layer1': 'fc': [*, 256], 'batch normalization', 'act fn': 'Leaky relu', 'layer2': 'fc': [256, 256], 'batch normalization', 'act fn': 'Leaky relu', 'layer3': 'fc': [256, 1], 'Sigmoid',

annex

our paper, we assume the target data is i.i.d. sampled from D(x, y). It is equivalently viewed that we first i.i.d. sample y ∼ D(y), then i.i.d. sample x ∼ D(x|y). Generally the D(y) is non-uniform, thus few-shot setting are generally not applicable for our theoretical assumptions.Multi-Task Learning The goal of multi-task learning (Zhang & Yang, 2017) aims to improve the prediction performance of all the tasks. In our paper, we aim at controlling the prediction risk of a specified target domain. We also notice some practical techniques are common such as the shared parameter (Zhang & Yeung, 2012) , shared representation (Ruder, 2017) 

D TABLE OF NOTATION

Empirical Risk on observed data {(x i , y i )} N i=1 that are i.i.d. sampled from D. α and αtTrue and empirical label distribution ratio α(y) = T (y)/S(y)Conditional distribution w.r.t. latent variable Z that induced by feature learning function g. W 1 (S t (z|y) T (z|y))Conditional Wasserstein distance on the latent space Z

E PROOF OF THEOREM 1

Proof idea Theorem 1 consists three steps in the proof: Lemma 2. If the prediction loss is assumed as L-Lipschitz and the hypothesis is K-Lipschitz w.r.t. the feature x (given the same label), i.e. for ∀Y = y, h(x 1 , y) -h(x 2 , y) 2 ≤ K x 1 -x 2 2 . Then the target risk can be upper bounded by:Proof. The target risk can be expressed as:We choose the MLP model with• feature representation function g: [5000, 1000] units• Task prediction and domain discriminator function [1000, 500, 100] units,We choose the dropout rate as 0.7 in the hidden and input layers. The hyper-parameters are chosen based on cross-validation. The neural network is trained for 50 epochs and the mini-batch size is 20 per domain. The optimizer is Adadelta with a learning rate of 0.5.Experimental Setting We use the amazon Review dataset for two transfer learning scenarios (limited target labels and unsupervised DA). We first randomly select 2K samples for each domain. Then we create a drifted distribution of each source, making each source ≈ 1500 and target sample still 2K.In the unsupervised DA, we use these labeled source tasks and unlabelled target task, which aims to predict the labels on the target domain.In the conventional transfer learning, we random sample only 10% dataset (≈ 200 samples) as the target training set and the rest 90% samples as the target test set.We select C 0 = 0.01 and C 1 = 1 for these two transfer scenarios. In both practical settings, we set the maximum training epoch as 50. We also visualize the label distribution in these four datasets. The original datasets show an almost uniform label distribution on the MNIST as well as Synth, (showing in Fig. 7 (a) ). In our paper, we generate a label distribution drift on the source datasets for each multi-source transfer learning. Concretely, we drop 50% of the data on digits 5-9 of all the sources while we keep the target label distribution unchanged. (Fig. 7 (b) illustrated one example with sources: Mnist, USPS, SVHN, and Target Synth. We drop the labels only on the sources.)

Negative

MNIST and USPS images are resized to 32 × 32 and represented as 3-channel color images to match the shape of the other three datasets. Each domain has its own given training and test sets when downloaded. Their respective training sample sizes are 60000, 7219, 73257, 479400, and the respective test sample sizes are 10000, 2017, 26032, 9553.The model structure is shown in Fig. 6 . There is no dropout and the hyperparameters are chosen based on cross-validation. It is trained for 60 epochs and the mini-batch size is 128 per domain. The optimizer is Adadelta with a learning rate of 1.0. We adopted γ = 0.5 for MDAN and γ = 0.1 for DARN in the baseline (Wen et al., 2020) .Experimental Setting We use the Digits dataset for two transfer learning scenarios (limited target labels and unsupervised DA). Notice the USPS data has only 7219 samples and the digits dataset is relatively simple. We first randomly select 7K samples for each domain. We create a drifted distribution of each source, making each source ≈ 5300, and the target sample still 7K.In the unsupervised DA, we use these labeled source tasks and unlabelled target task, which aims to predict the labels on the target domain.In the transfer learning with limited data, we random sample only 10% dataset (≈ 700 samples) as the target training set and the rest 90% samples as the target test set.We select C 0 = 0.01 and C 1 as the maximum prediction loss C 1 = max t R αt (h) as the hyperparameters across these two scenarios. We randomly drop 50% data on digits 5-9 in all sources while keeping target label distribution unchanged.

M.3 OFFICE-HOME DATASET

To show the dataset in the complex scenarios, we use the challenging Office-Home dataset (Venkateswara et al., 2017) . It contains images of 65 objects such as a spoon, sink, mug, and pen from four different domains: Art (paintings, sketches, and/or artistic depictions), Clipart (clipart images), Product (images without background), and Real-World (regular images captured with a camera). One of the four datasets is chosen as an unlabelled target domain and the other three datasets are used as labeled source domains.The dataset size is 2427 (Art), 4365 (Clipart), 4439 (Product), 4357 (Real-World). We follow the same training/test procedure as (Wen et al., 2020) . We additionally visualize the label distribution D(y) in four domains in Fig. 9 , which illustrated the inherent different label distributions. We did not re-sample the source label distribution to uniform distribution in the data pre-processing step. All the baselines are evaluated under the same setting.We use the ResNet50 (He et al., 2016) pretrained from the ImageNet in PyTorch as the base network for feature learning and put an MLP with the network structure shown in Fig. 10 .Experimental Settings We use the original Office-Home dataset for two transfer learning scenarios (unsupervised DA and label-partial unsupervised DA). We use SGD optimizer with learning rate 0.005, momentum 0.9 and weight decay value 1e-3. It is trained for 100 epochs and the mini-batch size is 32 per domain. As for the baselines, MDAN use γ = 1.0 while DARN use γ = 0.5. We select C 0 = 0.01 and C 1 as the maximum prediction loss C 1 = max t R αt (h) as the hyper-parameters across these two scenarios.In the multi-source unsupervised partial DA, we randomly select 35 classes from the target (by repeating 3 samplings), then at each sampling we run 5 times. The final result is based on these 3 × 5 = 15 repetitions. Alpha(C)(a) Target: Art 

