

Abstract

Given multiple source datasets with labels, how can we train a target model with no labeled data? Multi-source domain adaptation (MSDA) aims to train a model using multiple source datasets different from a target dataset in the absence of target data labels. MSDA is a crucial problem applicable to many practical cases where labels for the target data are unavailable due to privacy issues. Existing MSDA frameworks are limited since they align data without considering conditional distributions p(x|y) of each domain. They also do not fully utilize the target data without labels, and rely on limited feature extraction with a single extractor. In this paper, we propose MULTI-EPL, a novel method for multi-source domain adaptation. MULTI-EPL exploits label-wise moment matching to align conditional distributions p(x|y), uses pseudolabels for the unavailable target labels, and introduces an ensemble of multiple feature extractors for accurate domain adaptation. Extensive experiments show that MULTI-EPL provides the state-of-the-art performance for multi-source domain adaptation tasks in both of image domains and text domains.

1. INTRODUCTION

Given multiple source datasets with labels, how can we train a target model with no labeled data? A large training data are essential for training deep neural networks. Collecting abundant data is unfortunately an obstacle in practice; even if enough data are obtained, manually labeling those data is prohibitively expensive. Using other available or much cheaper datasets would be a solution for these limitations; however, indiscriminate usage of other datasets often brings severe generalization error due to the presence of dataset shifts (Torralba & Efros (2011) ). Unsupervised domain adaptation (UDA) tackles these problems where no labeled data from the target domain are available, but labeled data from other source domains are provided. Finding out domain-invariant features has been the focus of UDA since it allows knowledge transfer from the labeled source dataset to the unlabeled target dataset. There have been many efforts to transfer knowledge from a single source domain to a target one. Most recent frameworks minimize the distance between two domains by deep neural networks and distance-based techniques such as discrepancy regularizers (Long et al. (2015; 2016; 2017) ), adversarial networks (Ganin et al. (2016) ; Tzeng et al. (2017) ), and generative networks (Liu et al. (2017) ; Zhu et al. (2017) ; Hoffman et al. (2018b) ). While the above-mentioned approaches consider one single source, we address multi-source domain adaptation (MSDA), which is very crucial and more practical in real-world applications as well as more challenging. MSDA is able to bring significant performance enhancement by virtue of accessibility to multiple datasets as long as multiple domain shift problems are resolved. Previous works have extensively presented both theoretical analysis (Ben-David et al. (2010) ; Mansour et al. (2008) ; Crammer et al. (2008) ; Hoffman et al. (2018a) ; Zhao et al. (2018) ; Zellinger et al. (2020) ) and models (Zhao et al. (2018) ; Xu et al. (2018) ; Peng et al. (2019) ) for MSDA. MDAN (Zhao et al. (2018) ) and DCTN (Xu et al. (2018) ) build adversarial networks for each source domain to generate features domain-invariant enough to confound domain classifiers. However, these approaches do not encompass the shifts among source domains, counting only shifts between source and target domain. M 3 SDA (Peng et al. (2019) ) adopts moment matching strategy but makes the unrealistic assumption that matching the marginal probability p(x) would guarantee the alignment of the conditional probability p(x|y). Most of these methods also do not fully exploit the knowledge of target domain, imputing to the inaccessibility to the labels. Furthermore, all these methods leverage one single feature extractor, which possibly misses important information regarding label classification. In this paper, we propose MULTI-EPL (Multi-source domain adaptation with Ensemble of feature extractors, Pseudolabels, and Label-wise moment matching), a novel MSDA framework which mitigates the limitations of these methods of not explicitly considering conditional probability p(x|y), and relying on only one feature extractor. The model architecture is illustrated in Figure 1 . MULTI-EPL aligns the conditional probability p(x|y) by utilizing label-wise moment matching. We employ pseudolabels for the inaccessible target labels to maximize the usage of the target data. Moreover, generating an ensemble of features from multiple feature extractors gives abundant information about labels to the extracted features. Extensive experiments show the superiority of our methods. Our contributions are summarized as follows: • Method. We propose MULTI-EPL, a novel approach for MSDA that effectively obtains domain-invariant features from multiple domains by matching conditional probability p(x|y), utilizing pseudolabels for inaccessible target labels to fully deploy target data, and using an ensemble of multiple feature extractors. It allows domain-invariant features to be extracted, capturing the intrinsic differences of different labels. • Analysis. We theoretically prove that minimizing the label-wise moment matching loss is relevant to bounding the target error. • Experiments. We conduct extensive experiments on image and text datasets. We show that 1) MULTI-EPL provides the state-of-the-art accuracy, and 2) each of our main ideas significantly contributes to the superior performance.

2. RELATED WORK

Single-source Domain Adaptation. Given a labeled source dataset and an unlabeled target dataset, single-source domain adaptation aims to train a model that performs well on the target domain. The challenge of single-source domain adaptation is to reduce the discrepancy between the two domains and to obtain appropriate domain-invariant features. Various discrepancy measures such as Maximum Mean Discrepancy (MMD) (Tzeng et al. (2014) ; Long et al. (2015; 2016; 2017) ; Ghifary et al. ( 2016)) and KL divergence (Zhuang et al. (2015) ) have been used as regularizers. Inspired from the insight that the domain-invariant features should exclude the clues about its domain, constructing adversarial networks against domain classifiers has shown superior performance. Liu et al. (2017) and Hoffman et al. (2018b) deploy GAN to transform data across the source and target domain, while Ganin et al. (2016) and Tzeng et al. (2017) leverage the adversarial networks to extract common features of the two domains. Unlike these works, we focus on multiple source domains. Multi-source Domain Adaptation. Single-source domain adaptation should not be naively employed for multiple source domains due to the shifts between source domains. Many previous works have tackled MSDA problems theoretically. Mansour et al. (2008) establish distribution weighted combining rule that the weighted combination of source hypotheses is a good approximation for the target hypothesis. The rule is further extended to a stochastic case with joint distribution over the input and the output space in Hoffman et al. (2018a) . Crammer et al. (2008) propose the general theory of how to sift appropriate samples out of multi-source data using expected loss. Efforts to find out transferable knowledge from multiple sources from the causal viewpoint are made in Zhang et al. (2015) . There have been salient studies on the learning bounds for MSDA. Ben-David et al. (2010) found the generalization bounds based on H∆H-divergence, which are further tightened by Zhao et al. (2018) . Frameworks for MSDA have been presented as well. Zhao et al. (2018) propose learning algorithms based on the generalization bounds for MSDA. DCTN (Xu et al. (2018) ) resolves domain and category shifts between source and target domains via adversarial networks. M 3 SDA (Peng et al. (2019) ) associates all the domains into a common distribution by aligning the moments of the feature distributions of multiple domains. Lin et al. (2020) focus on the visual sentiment classification tasks and attempts to find out the common latent space of source and target domains. Wang et al. (2020) consider the interactions among multiple domains and reflect the information by constructing knowledge graph. However, all these methods do not consider multimode structures (Pei et al. (2018) ) that differently labeled data follow distinct distributions, even if they are drawn from the same domain. Also, the domain-invariant features in these methods contain the label information for only one label classifier which lead these methods to miss a large amount of label information. Different from these methods, our frameworks fully count the multimodal structures handling the data distributions in a label-wise manner and minimize the label information loss considering multiple label classifiers. Moment Matching. Domain adaptation has deployed the moment matching strategy to minimize the discrepancy between source and target domains. MMD regularizer (Tzeng et al. (2014) ; Long et al. (2015; 2016; 2017) ; Ghifary et al. ( 2016)) can be interpreted as the first-order moment while Sun et al. (2016) address second-order moments of source and target distributions. Zellinger et al. (2017) investigate the effect of higher-order moment matching. M 3 SDA (Peng et al. (2019) ) demonstrates that moment matching yields remarkable performance also with multiple sources. While previous works have focused on matching the moments of marginal distributions for single-source adaptation, we handle conditional distributions in multi-source scenarios.

3. PROPOSED METHOD

In this section, we describe our proposed method, MULTI-EPL. We first formulate the problem definition in Section 3.1. Then, we describe our main ideas in Section 3.2. Section 3.3 elaborates how to match label-wise moment with pseudolabels and Section 3.4 extends the approach by adding the concept of ensemble learning. Figure 1 shows the overview of MULTI-EPL.

3.1. PROBLEM DEFINITION

Given a set of labeled datasets from N source domains S 1 , . . . , S N and an unlabeled dataset from a target domain T , we aim to construct a model that minimizes test error on T . We formulate source domain S i as a tuple of the data distribution µ Si on data space X and the labeling function l Si : S i = (µ Si , l Si ). Source dataset drawn with the distribution µ Si is denoted as X Si = {(x Si j , y Si j )} n S i j=1 . Likewise, the target domain and the target dataset are denoted as T = (µ T , l T ) and X T = {x T j } n T j=1 , respectively. We narrow our focus down to homogeneous settings in classification tasks: all domains share the same data space X and label set C.

3.2. OVERVIEW

We propose MULTI-EPL based on the following observations: 1) existing methods focus on aligning the marginal distributions p(x) not the conditional ones p(x|y), 2) knowledge of the target data is not fully employed as no target label is given, and 3) there exists a large amount of label information loss since domain-invariant features are extracted for only one label classifier. Thus, we design MULTI-EPL aiming to solve the limitations. Designing such method entails the following challenges: 1. Matching conditional distributions. How can we align the conditional distribution, p(x|y), of multiple domains not the marginal one, p(x)? 2. Exploitation of the target data. How can we fully exploit the knowledge of the target data despite the absence of the target labels? 3. Maximally utilizing feature information. How can we maximally utilize the information that the domain-invariant features contain? We propose the following main ideas to address the challenges: 1. Label-wise moment matching (Section 3.3). We match the label-wise moments of the domain-invariant features so that the features with the same labels have similar distributions regardless of their original domains. 2. Pseudolabels (Section 3.3). We use pseudolabels as alternatives to the target labels. 3. Ensemble of feature representations (Section 3.4). We learn to extract ensemble of features from multiple feature extractors, each of which involves distinct domain-invariant features for its own label classifier.

3.3. LABEL-WISE MOMENT MATCHING WITH PSEUDOLABELS

We describe how MULTI-EPL matches conditional distributions p(x|y) of the features from multiple distinct domains. In MULTI-EPL, a feature extractor f e and a label classifier f lc lead the features to be domain-invariant and label-informative at the same time. The feature extractor f e extracts features from data, and the label classifier f lc receives the features and predicts the labels for the data. We train f e and f lc , according to the losses for label-wise moment matching and label classification, which make the features domain-invariant and label-informative, respectively. Label-wise Moment Matching. To achieve the alignment of domain-invariant features, we define a label-wise moment matching loss as follows: L lmm,K = 1 |C| ! N + 1 2 " -1 K # k=1 # D,D ′ # c∈C $ $ $ $ $ $ $ 1 n D,c # j;y D j =c f e (x D j ) k - 1 n D ′ ,c # j;y D ′ j =c f e (x D ′ j ) k $ $ $ $ $ $ $ 2 , where K is a hyperparameter indicating the maximum order of moments considered by the loss, D and D ′ are two distinct domains amongst the N + 1 domains, and n D,c is the number of data labeled as c in X D . We introduce pseudolabels for the target data, which are determined by the outputs of the model currently being trained, to manage the absence of the ground truths for the target data. In other words, we leverage f lc (f e (x T )) to give the pseudolabel to the target data x T . Drawing the pseudolabels using the incomplete model, however, brings mis-labeling issue which impedes further training. To alleviate this problem, we set a threshold τ and assign the pseudolabels to the target data only when the prediction confidence is greater than the threshold. The target examples with low confidence are not pseudolabeled and not counted in label-wise moment matching. By minimizing L lmm,K , the feature extractor f e aligns data from multiple domains by bringing consistency in distributions of the features with the same labels. The data with distinct labels are aligned independently, taking account of the multimode structures that differently labeled data follow different distributions. Label Classification. The label classifier f lc gets the features projected by f e as inputs and makes the label predictions. The label classification loss is defined as follows: L lc = 1 N N # i=1 1 n Si n S i # j=1 L ce (f lc (f e (x Si j )), y Si j ), where L ce is the softmax cross-entropy loss. Minimizing L lc separates the features with different labels so that each of them gets label-distinguishable.

3.4. ENSEMBLE OF FEATURE REPRESENTATIONS

In this section, we introduce ensemble learning for further enhancement. Features extracted with the strategies elaborated in the previous section contain the label information for a single label classifier. However, each label classifier leverages only limited label characteristics, and thus the conventional scheme to adopt only one pair of feature extractor and label classifier captures only a small part of the label information. Our idea is to leverage an ensemble of multiple pairs of feature extractor and label classifier in order to make the features to be more label-informative. We train multiple pairs of feature extractor and label classifier in parallel following the label-wise moment matching approach explained in Section 3.3. Let n denote the number of the feature extractors in the overall model. We denote the n (feature extractor, label classifier) pairs as (f e,1 , f lc,1 ), (f e,2 , f lc,2 ), . . . , (f e,n , f lc,n ) and the n resultant features as f eat 1 , f eat 2 , . . . , f eat n where f eat i is the output of the feature extractor f e,i . After obtaining n different feature mapping modules, we concatenate the n features into one vector f eat f inal = concat(f eat 1 , f eat 2 , . . . , f eat n ). The final label classifier f lc,f inal takes the concatenated feature as input, and predicts the label of the feature. Naively exploiting multiple feature extractors, however, does not guarantee the diversity of the features since it resorts to the randomness. Thus, we introduce a new model component, extractor classifier, which separates the features from different extractors. The extractor classifier f ec gets the features generated by a feature extractor as inputs and predicts which feature extractor has generated the features. For example, if n = 2, the extractor classifier f ec attempts to predict whether the input feature is extracted by the extractor f e,1 or f e,2 . By training the extractor classifier and multiple feature extractors at once, we explicitly diversify the features obtained from different extractors. We train the extractor classifier utilizing the feature diversifying loss, L f d : L f d = 1 N + 1 % & N # i=1 1 n Si n S i # j=1 n # k=1 L ce (f e,k (x Si j ), k) + 1 n T n T # j=1 n # k=1 L ce (f e,k (x T j ), k) ' ( , ( ) where n is the number of feature extractors.

3.5. MULTI-EPL: ACCURATE MULTI-SOURCE DOMAIN ADAPTATION

Our final model MULTI-EPL consists of n pairs of feature extractor and label classifier, (f e,1 , f lc,1 ), (f e,2 , f lc,2 ), . . . , (f e,n , f lc,n ), one extractor classifier f ec , and one final label classifier f lc,f inal . We first train the entire model except the final label classifier with the loss L: L = n # k=1 L lc,k + α n # k=1 L lmm,K,k + βL f d , where L lc,k is the label classification loss of the classifier f lc,k , L lmm,K,k is the label-wise moment matching loss of the feature extractor f e,k , and α and β are the hyperparameters. Then, the final label classifier is trained with respect to the label classification loss L lc,f inal using the concatenated features from multiple feature extractors.

4. ANALYSIS

We present a theoretical insight regarding the validity of the label-wise moment matching loss. For simplicity, we tackle only binary classification tasks. The error rate of a hypothesis h on a domain D is denoted as $ D (h) = P r{h(x) ∕ = l D (x)} where l D is the labeling function on the domain D. We first introduce k-th order label-wise moment divergence. Definition 1. Let D and D ′ be two domains over an input space X ⊂ R n where n is the dimension of the inputs. Let C be the set of the labels, and µ c (x) and µ ′ c (x) be the data distribution given that the label is c, i.e. µ c (x) = µ(x|y = c) and µ ′ c (x) = µ ′ (x|y = c) for the data distribution µ and µ ′ on the domains D and D ′ , respectively. Then, the k-th order label-wise moment divergence d LM,k (D, D ′ ) of the two domains D and D ′ over X is defined as d LM,k (D, D ′ ) = # c∈C # i∈∆ k ) ) ) ) ) ) p(c) * X µ c (x) n + j=1 (x j ) ij dx -p ′ (c) * X µ ′ c (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) , where ∆ k = {i = (i 1 , . . . , i n ) ∈ N n 0 | , n j=1 i j = k} is the set of the tuples of the nonnegative integers, which add up to k, p(c) and p ′ (c) are the probability that arbitrary data from D and D ′ to be labeled as c respectively, and the data x ∈ X is expressed as (x 1 , . . . , x n ). The ultimate goal of MSDA is to find a hypothesis h with the minimum target error. We nevertheless train the model with respect to the source data since ground truths for the target are unavailable. Let N datasets be drawn from N labeled source domains S 1 , . . . , S N respectively. We denote i-th source dataset X Si as {(x Si j , y Si j )} n S i j=1 . The empirical error of hypothesis h in i-th source domain S i estimated with X Si is formulated as $Si (h) = 1 n S i , n S i j=1 1 h(x S i j )∕ =y S i j . Given a weight vector α = (α 1 , α 2 , . . . , α N ) such that , N i=1 α i = 1, the weighted empirical source error is formulated as $α (h) = , N i=1 α i $Si (h). We extend the theorems in Ben-David et al. ( 2010); Peng et al. ( 2019) and derive a bound for the target error $ T (h), for h trained with source data, in terms of k-th order label-wise moment divergence. Theorem 1. Let H be a hypothesis space of VC dimension d, n Si be the number of samples from source domain S i , m = , N i=1 n Si be the total number of samples from N source domains S 1 , . . . , S N , and β = (β 1 , . . . , β N ) with β i = n S i m . Let us define a hypothesis ĥ = arg min h∈H $α (h) that minimizes the weighted empirical source error, and a hypothesis h * T = arg min h∈H $ T (h) that minimizes the true target error. Then, for any δ ∈ (0, 1) and $ > 0, there exist N integers nfoot_0 ! , . . . , n N ! and N constants a n 1 ! , . . . , a n N ! such that $ T ( ĥ) ≤ $ T (h * T ) + η α,β,m,δ + $ + N # i=1 α i % & 2λ i + a n i ! n i ! # k=1 d LM,k (S i , T ) ' ( with probability at least 1 -δ, where η α,β,m,δ = 4 -. , N i=1 α 2 i βi / ! 2d(log( 2m d )+1)+2 log( 4 δ ) m " λ i = min h∈H {$ T (h) + $ Si (h)}. Proof. See the Appendix A.1. Speculating that all datasets are balanced against the annotations, i.e., p(c) = p ′ (c) = 1 |C| for any c ∈ C, L lmm,K is expressed as the sum of the estimates of d LM,k with k = 1, . . . , K. The theorem provides an insight that label-wise moment matching allows the model trained with source data to have performance comparable to the optimal one on the target domain.

5. EXPERIMENTS

We conduct experiments to answer the following questions of MULTI-EPL. 2016)) pretrained on ImageNet is used as the backbone architecture. For Amazon Reviews, the feature extractor is composed of three fullyconnected layers each with 1000, 500, and 100 output units, and a single fully-connected layer with 100 input units and 2 output units is adopted for both of the extractor and label classifiers. With Digits-Five, LeNet5 Training Details. We train our models for Digits-Five with Adam optimizer (Kingma & Ba ( 2015)) with β 1 = 0.9, β 2 = 0.999, and the learning rate of 0.0004 for 100 epochs. All images are scaled to 32 × 32 and the mini batch size is set to 128. We set the hyperparameters α = 0.0005, β = 1, and K = 2. For the experiments with Office-Caltech10, all the modules comprising our model are trained following SGD with the learning rate 0.001, except that the optimizers for feature extractors have the learning rate 0.0001. We scale all the images to 224 × 224 and set the mini batch size to 48. All the hyperparameters are kept the same as in the experiments with Digits-Five. For Amazon Reviews, we train the models for 50 epochs using Adam optimizer with β 1 = 0.9, β 2 = 0.999, and the learning rate of 0.0001. We set α = β = 1, K = 2, and the mini batch size to 100. For every experiment, the confidence threshold τ is set to 0.9.

5.2. PERFORMANCE EVALUATION

We evaluate the performance of MULTI-EPL with n = 2 against the competitors. We repeat experiments for each setting five times and report the mean and the standard deviation. The results are summarized in Table 2 . Note that MULTI-EPL provides the best accuracy in all the datasets, showing its consistent superiority in both image datasets (Digits-Five, Office-Caltech10) and text dataset (Amazon Reviews). The enhancement is remarkable especially when MNIST-M is the target domain in Digits-Five, improving the accuracy by 11.48% compared to the state-of-the-art methods. 

5.3. ABLATION STUDY

We perform an ablation study on Digits-Five to identify what exactly enhances the performance of MULTI-EPL. We compare MULTI-EPL with 3 of its variants: MULTI-0, MULTI-PL, and MULTI-EPL-R. MULTI-0 aligns moments regardless of the labels of the data. MULTI-PL trains the model without ensemble learning. MULTI-EPL-R exploits ensemble learning strategy but relies on randomness without the extractor classifier and the feature diversifying loss. The results are shown in Table 3 . By comparing MULTI-0 with MULTI-PL, we observe that considering labels in moment matching plays a significant role in extracting domain-invariant features. The remarkable performance gap between MULTI-PL and MULTI-EPL with n = 2 verifies the effectiveness of ensemble learning. Comparing MULTI-EPL and MULTI-EPL-R, MULTI-EPL shows a better performance than MULTI-EPL-R in half of the cases; this means that explicitly diversifying loss often helps further improve the accuracy, while resorting to randomness for feature diversification also works in general. Hence, we conclude that we are able to apply ensemble learning approach without concern about the redundancy in features.

5.4. EFFECTS OF ENSEMBLE

We vary n, the number of pairs of feature extractor and label classifier, and repeat the performance evaluation on Digits-Five. The results are summarized in Table 3 . While an ensemble of two pairs gives much better performance than the model with a single pair, using more than two pairs rarely brings further improvement. This result demonstrates that two pairs of feature extractor and label classifier are able to cover most information without losing important label information in Digits-Five. It is notable that increasing n sometimes brings small performance degradation. As more feature extractors are adopted to obtain final features, the complexity of final features increases. It is harder for the final label classifier to manage the features with high complexity compared to the simple ones. This deteriorates the performance when we exploit more than two feature extractors.

6. CONCLUSION

We propose MULTI-EPL, a novel framework for the multi-source domain adaptation problem. MULTI-EPL overcomes the problems in existing methods of not directly addressing conditional distributions of data p(x|y), not fully exploiting knowledge of target data, and missing large amount of label information. MULTI-EPL aligns data from multiple source domains and the target domain considering the data labels, and exploits pseudolabels for exploiting unlabeled target data. MULTI-EPL further enhances the performance by generating an ensemble of multiple feature extractors. Our framework exhibits superior performance on both image and text classification tasks. Considering labels in moment matching and adding ensemble learning idea is shown to bring remarkable performance enhancement through ablation study. Future works include extending our approach to other tasks such as regression, which may require modification in the pseudolabeling method.

A APPENDIX

A.1 PROOF FOR THEOREM 1 In this section, we prove Theorem 1 in the paper. We first define k-th order label-wise moment divergence d LM,k , and disagreement ratio $ D (h 1 , h 2 ) of the two hypotheses h 1 , h 2 ∈ H on the domain D. Definition 1. Let D and D ′ be two domains over an input space X ⊂ R n where n is the dimension of the inputs. Let C be the set of the labels, and µ c (x) and µ ′ c (x) be the data distributions given that the label is c, i.e. µ c (x) = µ(x|y = c) and µ ′ c (x) = µ ′ (x|y = c) for the data distribution µ and µ ′ on the domains D and D ′ , respectively. Then, the k-th order label-wise moment divergence d LM,k (D, D ′ ) of the two domains D and D ′ over X is defined as d LM,k (D, D ′ ) = # c∈C # i∈∆ k ) ) ) ) ) ) p(c) * X µ c (x) n + j=1 (x j ) ij dx -p ′ (c) * X µ ′ c (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) , where ∆ k = {i = (i 1 , . . . , i n ) ∈ N n 0 | , n j=1 i j = k} is the set of the tuples of the nonnegative integers, which add up to k, p(c) and p ′ (c) are the probability that arbitrary data from D and D ′ to be labeled as c respectively, and the data x ∈ X is expressed as (x 1 , . . . , x n ). Definition 2. Let D be a domain over an input space X ⊂ R n with the data distribution µ(x). Then, we define the disagreement ratio $ D (h 1 , h 2 ) of the two hypotheses h 1 , h 2 ∈ H on the domain D as $ D (h 1 , h 2 ) = Pr x∼µ(x) [h 1 (x) ∕ = h 2 (x)]. Theorem 2. (Stone-Weierstrass Theorem (Stone (1937))) Let K be a compact subset of R n and f : K → R be a continuous function. Then, for every $ > 0, there exists a polynomial, P : K → R, such that sup x∈K |f (x) -P (x)| < $. Theorem 2 indicates that continuous functions on a compact subset of R n are approximated with polynomials. We next formulate the discrepancy of the two domains using the disagreement ratio and bound it with the label-wise moment divergence. Lemma 1. Let D and D ′ be two domains over an input space X ∈ R n , where n is the dimension of the inputs. Then, for any hypotheses h 1 , h 2 ∈ H and any $ > 0, there exist n ! ∈ N and a constant a n! such that |$ D (h 1 , h 2 ) -$ D ′ (h 1 , h 2 )| ≤ 1 2 a n! n! # k=1 d LM,k (D, D ′ ) + $. ( ) Proof. Let the domains D and D ′ have the data distribution of µ(x) and µ ′ (x), respectively, over an input space X , which is a compact subset of R n , where n is the dimension of the inputs. For brevity, we denote |$ D (h 1 , h 2 ) -$ D ′ (h 1 , h 2 )| as ∆ D,D ′ . Then, ∆ D,D ′ = |$ D (h 1 , h 2 ) -$ D ′ (h 1 , h 2 )| ≤ sup h1,h2∈H |$ D (h 1 , h 2 ) -$ D ′ (h 1 , h 2 )| = sup h1,h2∈H ) ) ) ) Pr x∼µ(x) [h 1 (x) ∕ = h 2 (x)] -Pr x∼µ ′ (x) [h 1 (x) ∕ = h 2 (x)] ) ) ) ) = sup h1,h2∈H ) ) ) ) * X µ(x)1 h1(x)∕ =h2(x) dx - * X µ ′ (x)1 h1(x)∕ =h2(x) dx ) ) ) ) . (11) For any hypotheses h 1 , h 2 , the indicator function 1 h1(x)∕ =h2(x) is Lebesgue integrable on X , i.e. 1 h1(x)∕ =h2(x) is a L 1 function. Since a set of continuous functions is dense in L 1 (X ), for every $ > 0, there exists a continuous L 1 function f defined on X such that ) ) 1 h1(x)∕ =h2(x) -f (x) ) ) ≤ $ 4 (12) for every x ∈ X , and the fixed h 1 and h 2 that drive equation 11 to the supremum. Accordingly, f (x) - $ 4 ≤ 1 h1(x)∕ =h2(x) ≤ f (x) + $ 4 . ( ) By integrating every term in the inequality over X , the inequality, * X µ(x)f (x)dx - $ 4 ≤ * X µ(x)1 h1(x)∕ =h2(x) dx ≤ * X µ(x)f (x)dx + $ 4 , follows. Likewise, the same inequality on the domain D ′ with µ ′ instead of µ holds. By subtracting the two inequalities and reformulating it, the inequality, - $ 2 ≤ ) ) ) ) µ ′ (x)f (x)dx ) ) ) ) + $ 2 . ( ) By the Theorem 2, there exists a polynomial P (x) such that sup x∈X |f (x) -P (x)| < $ 4 , and the polynomial P (x) is expressed as P (x) = n! # k=1 # i∈∆ k α i n + j=1 (x j ) ij , where n ! is the order of the polynomial, ∆ k = {i = (i 1 , . . . , i n ) ∈ N n 0 | , n j=1 i j = k} is the set of the tuples of the nonnegative integers, which add up to k, α i is the coefficient of each term of the polynomial, and x = (x 1 , x 2 , . . . , x n ). By applying equation 17 to the equation 16 and substituting the expression in equation 18,  ∆ D,D ′ ≤ ) ) ) ) * X µ(x)P (x)dx - * X µ ′ (x)P (x)dx ) ) ) ) + $ = ) ) ) ) ) ) * X µ(x) n! # k=1 # i∈∆ k α i n + j=1 (x j ) ij dx - * X µ ′ (x) n! # k=1 # i∈∆ k α i n + j=1 (x j ) ij dx ) ) ) ) ) ) + $ ≤ n! # k=1 ) ) ) ) ) ) # i∈∆ k α i * X µ(x) n + j=1 (x j ) ij dx -α i * X µ ′ (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) + $ ≤ n! # k=1 # i∈∆ k |α i | ) ) ) ) ) ) * X µ(x) n + j=1 (x j ) ij dx - * X µ ′ (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) + $ = n! # k=1 # i∈∆ k |α i | ) ) ) ) ) ) * X # c∈C p(c)µ c (x) n + j=1 (x j ) ij dx - * X # c∈C p ′ (c)µ ′ c (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) + $, ∆ k = max i∈∆ k |α i |, ∆ D,D ′ ≤ n! # k=1 a ∆ k # i∈∆ k ) ) ) ) ) ) * X # c∈C p(c)µ c (x) n + j=1 (x j ) ij dx - * X # c∈C p ′ (c)µ ′ c (x) n + j=1 (x j ) ij dx ) ) ) ) ) ) + $ ≤ n! # k=1 a ∆ k # i∈∆ k # c∈C ) ) ) ) ) ) p(c) * X µ c (x) n + j=1 (x j ) ij -p ′ (c) * X µ ′ c (x) n + j=1 (x j ) ij ) ) ) ) ) ) + $ ≤ n! # k=1 a ∆ k d LM.k (D, D ′ ) + $ ≤ 1 2 a n! n! # k=1 d LM,k (D, D ′ ) + $, (20) for a n! = 2 max 1≤k≤n! a ∆ k . Let N datasets be drawn from N labeled source domains S 1 , S 2 , . . . , S N respectively. We denote i-ith source dataset {(x Si j , y Si j )} n S i j=1 as X Si . The empirical error of hypothesis h in i-th source domain S i estimated with X Si is formulated as $Si (h) = 1 n S i , n S i j=1 1 h(x S i j )∕ =y S i j . Given a positive weight vector α = (α 1 , α 2 , . . . , α N ) such that , N i=1 α i = 1 and α i ≥ 0, the weighted empirical source error is formulated as $α (h) = , N i=1 α i $Si (h). Lemma 2. For N source domains S 1 , S 2 , . . . , S N , let n Si be the number of samples from source domain S i , m = , N i=1 n Si be the total number of samples from N source domains, and β = (β 1 , β 2 , . . . , β N ) with β i = n S i m . Let $ α (h) be the weighted true source error which is the weighted sum of $ Si (h) = Pr x∼µ(x) [h(x) ∕ = y]. Then, Pr [|$ α (h) -$ α (h)| ≥ $] ≤ 2 exp % & -2m$ 2 , N i=1 α 2 i βi ' ( Proof. It has been proven in Ben-David et al. (2010) . We now turn our focus back to the Theorem 1 in the paper and complete the proof. Theorem 1. Let H be a hypothesis space of VC dimension d, n Si be the number of samples from source domain S i , m = , N i=1 n Si be the total number of samples from N source domains S 1 , . . . , S N , and β = (β 1 , . . . , β N ) with β i = n S i m . Let us define a hypothesis ĥ = arg min h∈H $α (h) that minimizes the weighted empirical source error, and a hypothesis h * T = arg min h∈H $ T (h) that minimizes the true target error. Then, for any δ ∈ (0, 1) and $ > 0, there exist N integers n 1 ! , . . . , n N ! and N constants a n 1 ! , . . . , a n N ! such that Proof. $ T ( ĥ) ≤ $ T (h * T ) + η α,β,m,δ + $ + N # i=1 α i % & 2λ i + a n i ! n i ! # k=1 d LM,k (S i , T ) ' ( |$ α (h) -$ T (h)| = ) ) ) ) ) N # i=1 α i $ Si (h) -$ T (h) ) ) ) ) ) ≤ N # i=1 α i |$ Si (h) -$ T (h)|. ( ) We define h * i = arg min h∈H $ Si (h) + $ T (h) for every i = 1, 2, . . . , N for the following equations. We also note that the 1-triangular inequality (Crammer et al. (2008) ) holds for binary classification tasks, i.e., $ D (h 1 , h 2 ) ≤ $ D (h 1 , h 3 ) + $ D (h 2 , h 3 ) for any hypothesis h 1 , h 2 , h 3 ∈ H and domain D. Then, |$ D (h) -$ D (h, h ′ )| = |$ D (h, l D ) -$ D (h, h ′ )| ≤ $ D (l D , h ′ ) = $ D (h ′ ) (24) for the ground truth labeling function l D on the domain D and two hypotheses h, h ′ ∈ H. Applying the definition and the inequality to equation 23, |$ α (h) -$ T (h)| ≤ N # i=1 α i (|$ Si (h) -$ Si (h, h * i )| + |$ Si (h, h * i ) -$ T (h, h * i )| + |$ T (h, h * i ) -$ T (h)|) ≤ N # i=1 α i ($ Si (h * i ) + |$ Si (h, h * i ) -$ T (h, h * i )| + $ T (h * i )) (25) By the definition of h * i , $ Si (h * i ) + $ T (h * i ) = λ i for λ i = min h∈H {$ T (h) + $ Si (h)}. Additionally, according to Lemma 1, for any $ > 0, there exists an integer n ! and a constant a n i ! such that |$ Si (h, h * i ) -$ T (h, h * i )| ≤ 1 2 a n i ! n i ! # k=1 d LM,k (S i , T ) + $ 2 . ( ) By applying these relations,  |$ α (h) -$ T (h)| ≤ N # i=1 α i % & λ i + 1 2 a n i ! n i ! # k=1 d LM,k (S i , T ) + $ 2 ' ( ≤ N # i=1 α i % & λ i + 1 2 a n i ! n i ! # k=1 d LM,k (S i , T ) ' ( + $ 2 . ( The last inequality holds by equation 27 with h = h * T .



https://people.eecs.berkeley.edu/ ˜jhoffman/domainadapt/ https://github.com/KeiraZhao/MDAN/blob/master/amazon.npz http://yann.lecun.com/exdb/mnist/ http://yaroslav.ganin.net http://ufldl.stanford.edu/housenumbers/ http://yaroslav.ganin.net https://www.kaggle.com/bistaumanga/usps-dataset



Figure 1: MULTI-EPL for n=2. MULTI-EPL consists of n pairs of feature extractor and label classifier, one extractor classifier, and one final label classifier. Colors and symbols of the markers indicate domains and class labels of the data, respectively.

Q1 Accuracy (Section 5.2). How well does MULTI-EPL perform in classification tasks? Q2 Ablation Study (Section 5.3). How much does each component of MULTI-EPL contribute to performance improvement? Q3 Effects of Degree of Ensemble (Section 5.4). How does the performance change as the number n of the pairs of the feature extractor and the label classifier increases? 5.1 EXPERIMENTAL SETTINGS Datasets. We use three kinds of datasets, Digits-Five, Office-Caltech10 1 , and Amazon Reviews 2 . Digits-Five consists of five datasets for digit recognition: MNIST 3 (LeCun et al. (1998)), MNIST-M 4 (Ganin & Lempitsky (2015)), SVHN 5 (Netzer et al. (2011)), SynthDigits 6 (Ganin & Lempitsky (2015)), and USPS 7 (Hastie et al. (2001)). We set one of them as a target domain and the rest as source domains. Following the conventions in prior works (Xu et al. (2018); Peng et al. (2019)), we randomly sample 25000 instances from the source training set and 9000 instances from the target training set to train the model except for USPS for which the whole training set is used.

We use 3 MSDA algorithms, DCTN(Xu et al. (2018)), M 3 SDA(Peng et al. (2019)), and M 3 SDA-β(Peng et al. (2019)), with state-of-the-art performances as baselines. All the frameworks share the same architecture for the feature extractor, the domain classifier, and the label classifier for consistency. For Digits-Five, we use convolutional neural networks based onLeNet5 (LeCun  et al. (1998)). For Office-Caltech10, ResNet50(He et al. (

19) where p(c) and p ′ (c) are the probability that an arbitrary data is labeled as class c in domain D and D ′ , respectively, and µ c (x) = µ(x|y = c) and µ ′ c (x) = µ ′ (x|y = c) are the data distribution given Under review as a conference paper at ICLR 2021 that the data is labeled as class c on domain D and D ′ , respectively. For a

with probability at least 1 -δ, where η α,β,m,δ min h∈H {$ T (h) + $ Si (h)}.

)By Lemma 2 and the standard uniform convergence bound for hypothesis classes of finite VC dimension (Ben-David et al. (2010)), $ T ( ĥ) ≤ $ α ( ĥ

Summary of datasets. Caltech10 is the dataset for image classification with 10 categories that Office31 dataset and Caltech dataset have in common. It involves four different domains: Amazon, Caltech, DSLR, and Webcam. We double the number of data by data augmentation and exploit all the original data and augmented data as training data and test data respectively.Amazon Reviews dataset contains customers' reviews on 4 product categories: Books, DVDs, Electronics, and Kitchen appliances. The instances are encoded into 5000-dimensional vectors and are labeled as being either positive or negative depending on their sentiments. We set each of the four categories as a target and the rest as sources. For all the domains, 2000 instances are sampled for training, and the rest of the data are used for the test. Details about the datasets are summarized in Table1.

Classification accuracy on Digits-Five, Office-Caltech10, and Amazon Reviews with and without domain adaptation. The letters before and after the slash represent source domains and a target domain respectively. In Digits-Five, T, M, S, D, and U stands for MNIST, MNIST-M, SVHN, SynthDigits, and USPS respectively. In Office-Caltech10 and Amazon Reviews, we indicate each domain using the first letter of its name. SC and SB indicate Source Combined and Single Best, respectively. Note that MULTI-EPL shows the best performance.

Experiments with MULTI-EPL and its variants.

