UNIFIED PRINCIPLES FOR MULTI-SOURCE TRANSFER LEARNING UNDER LABEL SHIFTS

Abstract

We study the label shift problem in multi-source transfer learning and derive new generic principles. Our proposed framework unifies the principles of conditional feature alignment, label distribution ratio estimation and domain relation weights estimation. Based on inspired practical principles, we provide unified practical framework for three multi-source label shift transfer scenarios: learning with limited target data, unsupervised domain adaptation and label partial unsupervised domain adaptation. We evaluate the proposed method on these scenarios by extensive experiments and show that our proposed algorithm can significantly outperform the baselines.

1. INTRODUCTION

Transfer learning (Pan & Yang, 2009) is based on the motivation that learning a new task is easier after having learned several similar tasks. By learning the inductive bias from a set of related source domains (S 1 , . . . , S T ) and then leveraging the shared knowledge upon learning the target domain T , the prediction performance can be significantly improved. Based on this, transfer learning arises in deep learning applications such as computer vision (Zhang et al., 2019; Tan et al., 2018; Hoffman et al., 2018b) , natural language processing (Ruder et al., 2019; Houlsby et al., 2019) and biomedical engineering (Raghu et al., 2019; Lundervold & Lundervold, 2019; Zhang & An, 2017) . To ensure a reliable transfer, it is critical to understand the theoretical assumptions between the domains. One implicit assumption in most transfer learning algorithms is that the label proportions remain unchanged across different domains (Du Plessis & Sugiyama, 2014) (i.e., S(y) = T (y)). However, in many real-world applications, the label distributions can vary markedly (i.e. label shift) (Wen et al., 2014; Lipton et al., 2018; Li et al., 2019b) , in which existing approaches cannot guarantee a small target generalization error, which is recently proved by Combes et al. (2020) . Moreover, transfer learning becomes more challenging when transferring knowledge from multiple sources to build a model for the target domain, as this requires an effective selection and leveraging the most useful source domains when label shift occurs. This is not only theoretically interesting but also commonly encountered in real-world applications. For example, in medical diagnostics, the disease distribution changes over countries (Liu et al., 2004; Geiss et al., 2014) . Considering the task of diagnosing a disease in a country without sufficient data, how can we leverage the information from different countries with abundant data to help the diagnosing? Obviously, naïvely combining all the sources and applying one-to-one single source transfer learning algorithm can lead to undesired results, as it can include low quality or even untrusted data from certain sources, which can severely influence the performance. In this paper, we study the label shift problem in multi-source transfer learning where S t (y) = T (y). We propose unified principles that are applicable for three common transfer scenarios: unsupervised Domain Adaptation (DA) (Ben-David et al., 2010) , limited target labels (Mansour et al., 2020) and partial unsupervised DA with supp(T (y)) ⊆ supp(S t (y)) (Cao et al., 2018) , where prior works generally treated them as separate scenario. It should be noted that this work deals with target shift without assuming that semantic conditional distributions are identical (i.e., S t (x|y) = T (x|y)), which is more realistic for real-world problems. Our contributions in this paper are two-folds: (I) We propose to use Wasserstein distance (Arjovsky et al., 2017) to develop a new target generalization risk upper bound (Theorem 1), which reveals the importance of label distribution ratio estimation and provides a principled guideline to learn the domain relation coefficients. Moreover, we provide a theoretical analysis in the context of representation learning (Theorem 2), which guides to learn a feature function that minimizes the conditional Wasserstein distance as well as controls the weighted source risk. We further reveal the relations in the aforementioned three scenarios lie in the different assumptions for estimating label distribution ratio. (II) Inspired by the theoretical results, we propose Wasserstein Aggregation Domain Network (WADN) for handling label-shift in multi-source transfer learning. We evaluate our algorithm on three benchmark datasets, and the results show that our algorithm can significantly outperform stateof-the-art principled approaches.

2. RELATED WORK

Multi-Source Transfer Learning Theories have been investigated in the previous literature with different principles to aggregate source domains. In the popular unsupervised DA, (Zhao et al., 2018; Peng et al., 2019; Wen et al., 2020; Li et al., 2018b ) adopted H-divergence (Ben-David et al., 2007) , discrepancy (Mansour et al., 2009) and Wasserstein distance (Arjovsky et al., 2017) of marginal distribution d(S t (x), T (x)) to estimate domain relations and dynamically leveraged different domains. These algorithms generally consists source risk, domain discrepancy and an un-observable term η, the optimal risk on all the domains, which are ignored in these approaches. However, as Combes et al. ( 2020) pointed out, ignoring the influence of η will be problematic when label distributions between source and target domains are significantly different. Therefore it is necessary to take η into consideration by using a small amount of labelled data is available for the target domain (Wen et al., 2020) . Following this line, very recent works (Konstantinov & Lampert, 2019; Wang et al., 2019a; Mansour et al., 2020) started to consider measure the divergence between two domains given label information for the target domain by using Y-discrepancy (Mohri & Medina, 2012) . However, we empirically showed these methods are still unable to handle label shift. Label-Shift Label-Shift (Zhang et al., 2013; Gong et al., 2016 ) is a common phenomena in the transfer learning with S(y) = T (y) and generally ignored by the previous multi-source transfer learning practice. Several theoretical principled approaches have been proposed such as (Azizzadenesheli et al., 2019; Garg et al., 2020) . In addition, (Combes et al., 2020; Wu et al., 2019) analyzed the generalized label shift problem in the one-to-one single unsupervised DA problem but did not provide guidelines of levering different sources to ensure a reliable transfer, which is more challenging. (Redko et al., 2019) proposed optimal transport strategy for the multiple unsupervised DA under label shift by assuming identical semantic conditional distribution. However they did not consider representation learning in conjunction with their framework and did not design neural network based approaches. Different from these, we analyzed our problem in the context of representation learning and propose an efficient and principled strategies. Moreover, our theoretical results highlights the importance of label shift problem in a variety of multi-source transfer problem. While the aforementioned work generally focus on the unsupervised DA problem, without considering unified rules for different scenarios (e.g. partial multi-source DA).

3. THEORETICAL INSIGHTS: TRANSFER RISK UPPER BOUND

We assume a scoring hypothesis defined on the input space X and output space Y with h : X × Y → R, is K-Lipschitz w.r.t. the feature x (given the same label), i.e. for ∀y, h(x 1 , y) -h(x 2 , y) 2 ≤ K x 1 -x 2 2 , and the loss function : R × R → R + is positive, L-Lipschitz and upper bounded by L max . We denote the expected risk w.r.t distribution D: R D (h) = E (x,y)∼D (h(x, y)) and its empirical counterpart (w.r.t. D) RD (h) = (x,y)∈ D (h(x, y)). We adopted Wasserstein-1 distance (Arjovsky et al., 2017) as a metric to measure the similarity of the domains. Compared with other divergences, Wasserstein distance has been theoretically proved tighter than TV distance (Gong et al., 2016) or Jensen-Shnannon divergence (Combes et al., 2020) . Based on previous work, the label shift is generally handled by label-distribution ratio weighted loss: R α S (h) = E (x,y)∼S α(y) (h(x, y)) with α(y) = T (y)/S(y). We also denote αt as its empirical counterpart, estimated from samples. Besides, to measure the task relations, we define a simplex λ with λ[t] ≥ 0, T t=1 λ[t] = 1 as the task relation coefficient vector by assigning high weight to

