A SIMPLE UNIFIED INFORMATION REGULARIZATION FRAMEWORK FOR MULTI-SOURCE DOMAIN ADAPTA-TION

Abstract

Adversarial learning strategy has demonstrated remarkable performance in dealing with single-source unsupervised Domain Adaptation (DA) problems, and it has recently been applied to multi-source DA problems. Although most existing DA methods use multiple domain discriminators, the effect of using multiple discriminators on the quality of latent space representations has been poorly understood. Here we provide theoretical insights into potential pitfalls of using multiple domain discriminators: First, domain-discriminative information is inevitably distributed across multiple discriminators. Second, it is not scalable in terms of computational resources. Third, the variance of stochastic gradients from multiple discriminators may increase, which significantly undermines training stability. To fully address these issues, we situate adversarial DA in the context of information regularization. First, we present a unified information regularization framework for multi-source DA. It provides a theoretical justification for using a single and unified domain discriminator to encourage the synergistic integration of the information gleaned from each domain. Second, this motivates us to implement a novel neural architecture called a Multi-source Information-regularized Adaptation Networks (MIAN). The proposed model significantly reduces the variance of stochastic gradients and increases computational-efficiency. Large-scale simulations on various multi-source DA scenarios demonstrate that MIAN, despite its structural simplicity, reliably outperforms other state-of-the-art methods by a large margin especially for difficult target domains.

1. INTRODUCTION

Although a large number of studies have demonstrated the ability of deep neural networks to solve challenging tasks, the tasks solved by networks are mostly confined to a similar type or a single domain. One remaining challenge is the problem known as domain shift (Gretton et al. (2009) ), where a direct transfer of information gleaned from a single source domain to unseen target domains may lead to significant performance impairment. Domain adaptation (DA) approaches aim to mitigate this problem by learning to map data of both domains onto a common feature space. Whereas several theoretical results (Ben-David et al. (2007) ; Blitzer et al. (2008) ; Zhao et al. (2019a) ) and algorithms for DA (Long et al. (2015; 2017) ; Ganin et al. (2016) ) have focused on the case in which only a single-source domain dataset is given, we consider a more challenging and generalized problem of knowledge transfer, referred to as Multi-source unsupervised DA (MDA). Following a seminal theoretical result on MDA (Blitzer et al. (2008) ; Ben-David et al. (2010) ), technical advances have been made, mainly on the adversarial methods. (Xu et al. (2018) ; Zhao et al. (2019c) ). While most of adversarial MDA methods use multiple independent domain discriminators (Xu et al. (2018) ; Zhao et al. (2018) ; Li et al. (2018) ; Zhao et al. (2019c; b) ), the potential pitfalls of this setting have not been fully explored. The existing works do not provide a theoretical guarantee that the unnecessary domain-specific information is fully filtered out, because the domain-discriminative information is inevitably distributed across multiple discriminators. For example, the multiple domain discriminators focus only on estimating the domain shift between source domains and the target, while the discrepancies between the source domains are neglected, making it hard to align all the given domains. This necessitates garnering the domain-discriminative information with a unified discriminator. Moreover, the multiple domain discriminator setting is not scalable in terms of computational resources especially when large number of source domains are given, e.g., medical reports from multiple patients. Finally, it may undermine the stability of training, as earlier works solve multiple independent adversarial minimax problems. To overcome such limitations, we propose a novel MDA method, called Multi-source Informationregularized Adaptation Networks (MIAN) , that constrains the mutual information between latent representations and domain labels. First, we show that such mutual information regularization is closely related to the explicit optimization of the H-divergence between the source and target domains. This affords the theoretical insight that the conventional adversarial DA can be translated into an information-theoretic-regularization problem. Second, based on our findings, we propose a new optimization problem for MDA: minimizing adversarial loss over multiple domains with a single domain discriminator. We show that the domain shift between each source domain can be indirectly penalized, which is known to be beneficial in MDA (Li et al. (2018) ; Peng et al. (2019) ), with a single domain discriminator. Moreover, by analyzing existing studies in terms of information regularization, we found that the variance of the stochastic gradients increases when using multiple discriminators. Despite its structural simplicity, we found that MIAN works efficiently across a wide variety of MDA scenarios, including the DIGITS-Five (Peng et al. (2019) ), Office-31 (Saenko et al. (2010) ), and Office-Home datasets (Venkateswara et al. (2017) ). Intriguingly, MIAN reliably and significantly outperformed several state-of-the-art methods that either employ a domain discriminator separately for each source domain (Xu et al. (2018) ) or align the moments of deep feature distribution for every pairwise domain (Peng et al. (2019) ).

2. RELATED WORKS

Several DA methods have been used in attempt to learn domain-invariant representations. Along with the increasing use of deep neural networks, contemporary work focuses on matching deep latent representations from the source domain with those from the target domain. Several measures have been introduced to handle domain shift, such as maximum mean discrepancy (MMD) (Long et al. (2014; 2015) ), correlation distance (Sun et al. (2016) ; Sun & Saenko (2016) ), and Wasserstein distance (Courty et al. (2017) ). Recently, adversarial DA methods (Ganin et al. (2016) ; Tzeng et al. (2017) ; Hoffman et al. (2017) ; Saito et al. (2018; 2017) ) have become mainstream approaches owing to the development of generative adversarial networks (Goodfellow et al. (2014) ). However, the abovementioned single-source DA approaches inevitably sacrifice performance for the sake of multi-source DA. Some MDA studies (Blitzer et al. (2008) ; Ben-David et al. (2010) ; Mansour et al. (2009) ; Hoffman et al. (2018) ) have provided the theoretical background for algorithm-level solutions. (Blitzer et al. (2008); Ben-David et al. (2010) ) explore the extended upper bound of true risk on unlabeled samples from the target domain with respect to a weighted combination of multiple source domains. Following these theoretical studies, MDA studies with shallow models (Duan et al. (2012b; a) ; Chattopadhyay et al. (2012) ) as well as with deep neural networks (Mancini et al. (2018) ; Peng et al. (2019) ; Li et al. (2018) ) have been proposed. Recently, some adversarial MDA methods have also been proposed. Xu et al. (2018) implemented a k-way domain discriminator and classifier to battle both domain and category shifts. Zhao et al. (2018) also used multiple discriminators to optimize the average case generalization bounds. Zhao et al. (2019c) chose relevant source training samples for the DA by minimizing the empirical Wasserstein distance between the source and target domains. Instead of using separate encoders, domain discriminators or classifiers for each source domain as in earlier works, our approach uses unified networks, thereby improving resource-efficiency and scalability. Several existing MDA works have proposed methods to estimate the source domain weights following (Blitzer et al. (2008); Ben-David et al. (2010) ). Mansour et al. (2009) assumed that the target hypothesis can be approximated by a convex combination of the source hypotheses. (Peng et al. (2019) ; Zhao et al. (2018) ) suggested ad-hoc schemes for domain weights based on the empirical risk of each source domain. Li et al. (2018) computed a softmax-transformed weight vector using the empirical Wasserstein-like measure instead of the empirical risks. Compared to the proposed methods without robust theoretical justifications, our analysis does not require any assumption or estimation for the domain coefficients. In our framework, the representations are distilled to be independent of the domain, thereby rendering the performance relatively insensitive to explicit weighting strategies. T . The domain label and its probability distribution are denoted by V and P V (v), where v ∈ V and V is the set of domain labels. In line with prior works (Hoffman et al. (2012) ; Gong et al. (2013) ; Mancini et al. (2018) ; Gong et al. (2019) ), domain label can be generally treated as a stochastic latent random variable in our framework. However, for simplicity, we take the empirical version of the true distributions with given samples assuming that the domain labels for all samples are known. The latent representation of the sample is given by Z, and the encoder is defined as F : X → Z, with X and Z representing data space and latent space, respectively. Accordingly, Z Si and Z T refer to the outputs of the encoder F (X Si ) and F (X T ), respectively. For notational simplicity, we will omit the index i from D Si , X Si and Z Si when N = 1. A classifier is defined as C : Z → Y where Y is the class label space.

3.1. PROBLEM FORMULATION

For comparison with our formulation, we recast single-source DA as a constrained optimization problem. The true risk T (h) on unlabeled samples from the target domain is bounded above the sum of three terms (Ben-David et al. (2010) ): (1) true risk S (h) of hypothesis h on the source domain; (2) H-divergence d H (D S , D T ) between a source and a target domain distribution; and (3) the optimal joint risk λ * . Theorem 1 (Ben-David et al. ( 2010)). Let hypothesis class H be a set of binary classifiers h : X → {0, 1}. Then for the given domain distributions D S and D T , ∀h ∈ H, T (h) ≤ S (h) + d H (D S , D T ) + λ * , where d H (D S , D T ) = 2sup h∈H E x∼D X S I(h(x) = 1) -E x∼D X T I(h(x) = 1 ) and I(a) is an indicator function whose value is 1 if a is true, and 0 otherwise. The empirical H-divergence dH (X S , X T ) can be computed as follows (Ben-David et al. (2010) ): Lemma 1. dH (X S , X T ) = 2 1 -min h∈H 1 m x∈X S I[h(x) = 1] + 1 m x∈X T I[h(x) = 0] Following Lemma 1, a domain classifier h : Z → V can be used to compute the empirical Hdivergence. Suppose the optimal joint risk λ * is sufficiently small as assumed in most adversarial DA studies (Saito et al. (2017) ; Chen et al. (2019) ). Thus, one can obtain the ideal encoder and classifier minimizing the upper bound of T (h) by solving the following min-max problem: F * , C * = arg min F,C L(F, C) + β dH (Z S , Z T ) = arg min F,C max h∈H L(F, C) + β 1 m i:zi∈Z S I[h(z i ) = 1] + j:zj ∈Z T I[h(z j ) = 0] , where L(F, C) is the loss function on samples from the source domain, β is a Lagrangian multiplier, V = {0, 1} such that each source instance and target instance are labeled as 1 and 0, respectively, and h is the binary domain classifier. Note that the latter min-max problem is obtained by converting -min into max and removing the constant term from Lemma 1.

3.2. INFORMATION-REGULARIZED MIN-MAX PROBLEM FOR MDA

Intuitively, it is not highly desirable to adapt the learned representation in the given domain to the other domains, particularly when the representation itself is not sufficiently domain-independent. This motivates us to explore ways to learn representations independent of domains. Inspired by a contemporary fair model training study (Roh et al. (2020) ), the mutual information between the latent representation and the domain label I(Z; V ) can be expressed as follows: Theorem 2. Let P Z (z) be the distribution of Z where z ∈ Z. Let h be a domain classifier h : Z → V, where Z is the feature space and V is the set of domain labels. Let h v (z) be a conditional probability of V where v ∈ V given Z = z, defined by h. Then the following holds: I(Z; V ) = max hv(z): v∈V hv(z)=1,∀z v∈V P V (v)E z∼P Z|v log h v (z) + H(V ) The detailed proof is provided in the Roh et al. (2020) and Supplementary Material. As done in Roh et al. (2020) , we can derive the empirical version of Theorem 2 as follows: Î(Z; V ) = max hv(z): v∈V hv(z)=1,∀z 1 M v∈V i:vi=v log h vi (z i ) + H(V ), ( ) where M is the number of total representation samples, i is the sample index, and v i is the corresponding domain label of the ith sample. Using this equation, we combine our information-constrained objective function and the results of Lemma 1. For binary classification V = {0, 1} with Z S and Z T of equal size M/2, we propose the following information-regularized minimax problem: F * , C * = arg min F,C L(F, C) + β Î(Z; V ) = arg min F,C max h∈H L(F, C) + β 1 M i:zi∈Z S log h(z i ) + j:zj ∈Z T log(1 -h(z j )) , where β is a Lagrangian multiplier, h(z i ) h vi=1 (z i ) and 1 -h(z i ) h vi=0 (z i ), with h(z i ) representing the probability that z i belongs to the source domain. This setting automatically dismisses the condition v∈V h v (z) = 1, ∀z. Note that we have accommodated a simple situation in which the entropy H(V ) remains constant.

3.3. ADVANTAGES OVER OTHER MDA METHODS

The relationship between (3) and ( 6) provides us a theoretical insights that the problem of minimizing mutual information between the latent representation and the domain label is closely related to minimizing the H-divergence using the adversarial learning scheme. This relationship clearly underlines the significance of information regularization for MDA. Compared to the existing MDA approaches (Xu et al. (2018) ; Zhao et al. (2018) ), which inevitably distribute domain-discriminative knowledge over N different domain classifiers, the above objective function (6) enables us to seamlessly integrate such information with the single-domain classifier h. Using a single domain discriminator also helps reduce the variance of gradient. Large variances in the stochastic gradients slow down the convergence, which leads to poor performance (Johnson & Zhang (2013) ). Herein, we analyze the variances of the stochastic gradients of existing optimization constraints. By excluding the weighted source combination strategy, we can simplify the optimization constraint of existing adversarial MDA methods as sum of the information constraints: N k=1 I(Z k ; U k ) = N k=1 max h k u (z): u∈U h k u (z)=1,∀z u∈U P U k (u)E z k ∼P Z k |u log h k u (z k ) + N k=1 H(U k ), (7) where U k is the kth domain label with U = {0, 1}, P Z k |u=0 (•) = P Z|v=N +1 (•) corresponding to the target domain, P Z k |u=1 (•) = P Z|v=k (•) corresponding to the kth source domain, and h k u (z k ) being the conditional probability of u ∈ U given z k defined by the kth discriminator indicating that the sample is generated from the kth source domain. Again, we treat the entropy H(U k ) as a constant. Note that the interaction information cannot be measured with (7). Given M = m(N + 1) samples with m representing the number of samples per domain, an empirical version of ( 7) is: N k=1 Î(Z k ; U k ) = 1 M N k=1 max h k u (z): u∈U h k u (z)=1,∀z u∈U i:u i =u log h k u (z i k ) + N k=1 H(U k ). Let I k be a shorthand for the kth term inside the first summation. Without loss of generality we make simplifying assumptions that all V ar[I k ] are the same for all k and so are Cov[I k , I j ] for all pairs. Then the variance of ( 8) is given by: V ar N k=1 Î(Z k ; U k ) = 1 M 2 N k=1 V ar[I k ] + 2 N k=1 N j=k Cov[I k , I j ] = 1 m 2 N (N + 1) 2 V ar[I k ] + N (N -1) (N + 1) 2 Cov[I k , I j ] . (9) As earlier works solve N adversarial minimax problems, the covariance term is additionally included and its contribution to the variance does not decrease with increasing N . In other words, the covariance term may dominate the variance of the gradients as the number of domain increases. In contrast, the variance of our constraint ( 5) is inversely proportional to (N + 1) 2 . Let I m be a shorthand for the maximization term except 1 M in (5). Then the variance of ( 5) is given by: V ar Î(Z; V ) = 1 m 2 (N + 1) 2 V ar[I m ] . It implies that our framework can significantly improve the stability of stochastic gradient optimization compared to existing approaches, especially when the model is deemed to learn from many domains.

3.4. SITUATING DOMAIN ADAPTATION IN CONTEXT OF INFORMATION BOTTLENECK THEORY

In this Section, we bridge the gap between the existing adversarial DA method and the information bottleneck (IB) theory (Tishby et al. (2000) ; Tishby & Zaslavsky (2015) ; Alemi et al. ( 2016)). Tishby et al. (2000) examined the problem of learning an encoding Z such that it is maximally informative about the class Y while being minimally informative about the sample X: min Penc(z|x) βI(Z; X) -I(Z; Y ), where β is a Lagrangian multiplier. Indeed, the role of the bottleneck term I(Z; X) matches our mutual information I(Z; V ) between the latent representation and the domain label. We foster close collaboration between two information bottleneck terms by incorporating those into I(Z; X, V ). Theorem 3. Let P Z|x,v (z) be a conditional probabilistic distribution of Z where z ∈ Z, defined by the encoder F , given a sample x ∈ X and the domain label v ∈ V. Let R Z (z) denotes a prior marginal distribution of Z. Then the following inequality holds: I(Z; X, V ) ≤ max hv(z): v∈V hv(z)=1,∀z v∈V P V (v)E P z∼Z|v log h v (z) + H(V ) + E x,v∼P X,V D KL [P Z|x,v R Z ] The proof of Theorem 3 uses the chain rule: I(Z; X, V ) = I(Z; V ) + I(Z; X | V ). The detailed proof is provided in the Supplementary Material. Whereas the role of I(Z; X | V ) is to purify the latent representation generated from the given domain, I(Z; V ) serves as a proxy for regularization that aligns the purified representations across different domains. Thus, the existing DA approaches (Luo et al. (2019) ; Song et al. ( 2019)) using variational information bottleneck (Alemi et al. ( 2016)) can be reviewed as special cases for Theorem 3 with a single-source domain.

4. MULTI-SOURCE INFORMATION-REGULARIZED ADAPTATION NETWORKS

In this Section, we provide the details of our proposed architecture, referred to as a multi-source information-regularized adaptation network (MIAN). MIAN addresses the information-constrained min-max problem for MDA (Section 3.2) using the three subcomponents depicted in Figure 1 : information regularization, source classification, and Decaying Batch Spectral Penalization (DBSP). Information regularization. To estimate the empirical mutual information Î(Z; V ) in ( 5), the domain classifier h should be trained to minimize softmax cross enropy. Let V = {1, 2, ..., N + 1} and denote h(z) as N + 1 dimensional vector of the conditional probability for each domain given the sample z. Let 1 be a N + 1 dimensional vector of all ones, and 1 [k=v] be a N + 1 dimensional vector whose vth value is 1 and 0 otherwise. Given M = m(N + 1) samples, the objective is: min h - 1 M v∈V i:vi=v 1 T [k=vi] log h(z i ) . In this study, we slightly modify the objective (13). Specifically, we explicitly minimized the conditional probability of the remaining domains excepting the vth domain. Let 1 [k =v] be the flipped version of 1 [k=v] . Then the objective function for the domain discriminator is: min h - 1 M v∈V i:vi=v 1 T [k=vi] log h(z i ) + 1 T [k =vi] log(1 -h(z i )) , where the objective function for encoder training is to maximize (14). Our objective function is also closely related to that of GAN (Goodfellow et al. (2014) ), and we experimentally found that using the variant objective function of GAN (Mao et al. (2017) ) works slightly better. The above objective is closely related to optimizing every pairwise domain discrepancy between the given domain and the mixture of the others. Let each D v and D v c represent the vth domain and the mixture of the remaining N domains with the same mixture weight 1 N , relatively. Then we can define H-divergence as d H (D v , D v c ), and an average of such H-divergence for every v as d H (V). Assume that the samples of size m, Z v and Z v c , are generated from each D v and D v c , where Z v c = v =v Z v with |Z v | = m/N for all v ∈ V. Thus the domain label v j = v for every jth sample in Z v c . Then the average of empirical H-divergence dH (V) is defined as follows: dH (V) = 1 N + 1 v∈V dH (Z v , Z v c ) = 1 N + 1 v∈V 2 1 -min h∈H 1 m i:vi=v I[h v (z i ) = 1] + 1 m j:vj =v I[h v (z j ) = 0] , where h v (z) represents the vth value of h(z). Note that h(z) corresponds to N + 1 dimensional one-hot classification vector in (15), unlike in (14). Then, let I[h(z)] := I(h v (z) = 1) v∈V be the N + 1 dimensional one-hot indicator vector. Given the unified domain discriminator h in the inner minimization for every v in (15), we train h to approximate the lower bound of dH (V) as follows: h * = arg max h∈H 1 M v∈V i:vi=v I[h v (z i ) = 1] + j:vj =v I[h v (z j ) = 0] = arg min h∈H - 1 M v∈V i:vi=v 1 T [k=vi] I[h(z i )] + 1 T [k =vi] 1 -I[h(z i )] , where the latter equality is obtained by rearranging the summation terms in the first equality. Based on the close relationship between ( 14) and ( 16), we can make the link between information regularization and H-divergence optimization given multi-source domain; minimizing dH (V) is closely related to implicit regularization of the mutual information between latent representations and domain labels. Because the output vector h(z) in ( 15) often comes from the arg max operation, ( 15) is not differentiable w.r.t. z. However, our framework has a differentiable objective as in ( 14). There are two benefits of minimizing d H (V). First, it includes H-divergence between the target and a mixture of sources, which directly affects the upper bound of the empirical risk on target samples (Theorem 5 in Ben-David et al. ( 2010)). Second, d H (V) lower-bounds the average of every pairwise H-divergence between domains. The detailed proof is provided in the appendix (Lemma 2). Note that unlike our single domain classifier setting, existing methods (Li et al. (2018) ) require a number of about O(N 2 ) domain classifiers to approximate all pairwise combinations of domain discrepancy. Source classification. Along with learning domain-independent latent representations illustrated in the above, we train the classifier with the labeled source domain datasets that can be directly applied to the target domain representations in practice. To minimize the empirical risk on source domain, we use a generic softmax cross-entropy loss function with labeled source domain samples as L(F, C). Decaying batch spectral penalization. Applying above information-theoretic insights, we further describe a potential side effect of existing adversarial DA methods. Information regularization may lead to overriding implicit entropy minimization, particularly in the early stages of the training, impairing the richness of latent feature representations. To prevent such a pathological phenomenon, we introduce a new technique called Decaying Batch Spectral Penalization (DBSP), which is intended to control the SVD entropy of the feature space. Our version improves training efficiency compared to original Batch Spectral Penalization (Chen et al. (2019) ). We refer to this version of our model as MIAN-γ. As vanila MIAN is sufficient to outperform other state-of-the-art methods (Section 5), MIAN-γ is further discussed in the Supplementary Material. To assess the performance of MIAN, we ran a large-scale simulation using the following benchmark datasets: Digits-Five, Office-31 and Office-Home. For a fair comparison, we reproduced all the other baseline results using the same backbone architecture and optimizer settings as the proposed method.

5. EXPERIMENTS

For the source-only and single-source DA standards, we introduce two MDA approaches (Xu et The classification accuracy for Digits-Five, Office-31, and Office-Home are summarized in Tables 1, 2 , and 3, respectively. We found that MIAN outperforms most of other state-of-the-art single-source and multi-source DA methods by a large margin. Note that our method demonstrated a significant improvement in challenging domains, such as MNIST-M, Amazon or Clipart.

5.2. QUALITATIVE AND QUANTITATIVE ANALYSES

Design of domain discriminator. To quantify the extent to which performance improvement is achieved by unifying the domain discriminators, we compared the performances of the four different versions of MIAN (Figure 2a , 2b). No S-S align is the same as MIAN with the exception that only the target and each source domains are aligned. No LS uses the objective function as in ( 14), and unlike (Mao et al. (2017) ). Multi D employs as many discriminators as the number of source domains which is analogous to the existing approaches. For a fair comparison, all the other experimental settings are fixed. The results illustrate that all the versions with the unified discriminator reliably outperform Multi D in terms of both accuracy and reliability. This suggests that unification of the domain discriminators can substantially improves the task performance. Variance of stochastic gradients. With respect to the above analysis, we compared the variance of the stochastic gradients computed with different available domain discriminators. We trained MIAN and Multi D using mini-batches of samples. The number of samples in a batch was fixed as 128 per Proxy A-distance. To analyze the performance improvement in depth, we measured Proxy A-Distance (PAD) as an empirical approximation of domain discrepancy (Ganin et al. (2016) ). Given the generalization error on discriminating between the target and source samples, PAD is defined as dA = 2(1 -2 ). Figure 3a shows that MIAN yields lower PAD between the source and target domain on average, potentially associated with optimizing dH (V). To test this conjecture, we conducted an ablation study on the objective of domain discriminator (Figure 3b, 3c) . All the other experimental settings were fixed except for using the objective of the unified domain discriminator as (13), or ( 14). While both cases help the adaptation, using ( 14) yields lower dH (V) and higher test accuracy. Estimation of mutual information. We measure the empirical mutual information Î(Z; V ) with assuming H(V ) as a constant. For the measurement, we trained the domain discriminator to minimize the softmax cross entropy (13) with sufficient iterations. Figure 3d shows that MIAN yields the lowest Î(Z; V ), guaranteeing that the obtained representation achieves low-level of domain dependence.

6. CONCLUSION

In this paper, we have presented a unified information-regularization framework for MDA. The proposed framework allows us to examine the existing adversarial DA methods and also motivates us to implement a novel neural architecture for MDA. We provided both theoretical arguments and empirical evidence to fully justify three potential pitfalls of using multiple discriminators: dispersed domain-discriminative knowledge, lack of scalability and high variance in the objective. Our framework also establishes a bridge between adversarial DA and Information Bottleneck theory. The proposed model does not require complicated settings such as image generation, pretraining, multiple discriminators, multiple encoders or classifiers, which are often adopted in the existing MDA methods (Zhao et al. (2019b; c) ; Wang et al. (2019) ).

A PROOFS

In this Section, we present the detailed proofs for Theorems 2 and 3, explained in the main paper. We also present Lemma 2, as mentioned in Section 4. Following (Roh et al. (2020) ), we provide a proof of Theorem 2 below for the sake of completeness. A.1 PROOF OF THEOREM 2 Theorem 2. Let P Z (z) be the distribution of Z where z ∈ Z. Let h be a domain classifier h : Z → V, where Z is the feature space and V is the set of domain labels. Let h v (Z) be a conditional probability of V where v ∈ V given Z = z, defined by h. Then the following holds: I(Z; V ) = max hv(z): v∈V hv(z)=1,∀z v∈V P V (v)E z∼P Z|v log h v (z) + H(V ) Proof. By definition, I(Z; V ) = D KL P (Z, V ) P (Z)P (V ) = v∈V P V (v)E z∼P Z|v log P Z,V (z, v) P Z (z) + H(V ) Let us constrain the term inside the log by h v (z) = P Z,V (z,v) P Z (z) where h v (z) represents the conditional probability of V = v for any v ∈ V given Z = z. Then we have: v∈V h v (z) = 1 for all possible values of z according to the law of total probability. Let h denote the collection of h v (z) for all possible values of v and z, and λ be the collection of λ z for all values of z. Then, we can construct the Lagrangian function by incorporating the constraint v∈V h v (z) = 1 as follows: L(h, λ) = v∈V P V (v)E z∼P Z|v log h v (z) + H(V ) + z∈Z λ z 1 - v∈V h v (z) We can use the following KKT conditions: ∂L(h, λ) ∂h v (z) = P V (v) P Z|v (z) h * v (z) -λ * z = 0, ∀(z, v) ∈ Z × V (20) 1 - v∈V h * v (z) = 0, ∀z ∈ Z Solving the two equations, we have 1 - v∈V P V (v)P Z|v (z) λ * z = 0 such that λ * z = P Z (z) for all z. Then for all the possible values of z, h * v (z) = P Z,V (z, v) P Z (z) = P V |z (v), where the given h * v (z) is same as the term inside log in (18). Thus, the optimal solution of concave Lagrangian function (19) obtained by h * v (z) is equal to the mutual information in (18). The substitution of h * v (z) into (18) completes the proof. Our framework can further be applied to segmentation problems because it provides a new perspective on pixel space (Sankaranarayanan et al. (2018a;b); Murez et al. (2018) ) and segmentation space (Tsai et al. (2018) ) adaptation. The generator in pixel space and segmentation space adaptation learns to transform images or segmentation results from one domain to another. In the context of information regularization, we can view these approaches as limiting information I( X; V ) between the generated output X and the domain label V , which is accomplished by involving the encoder for pixel-level generation. This alleviates the domain shift in a raw pixel level. Note that one can choose between limiting the feature-level or pixel-level mutual information. These different regularization terms may be complementary to each other depending on the given task. A.2 PROOF OF THEOREM 3 Theorem 3. Let P Z|x,v (z) be a conditional probabilistic distribution of Z where z ∈ Z, defined by the encoder F , given a sample x ∈ X and the domain label v ∈ V. Let R Z (z) denotes a prior marginal distribution of Z. Then the following inequality holds: I(Z; X, V ) ≤ max hv(z): v∈V hv(z)=1,∀z v∈V P V (v)E P z∼Z|v log h v (z) + H(V ) + E x,v∼P X,V D KL [P Z|x,v R Z ] Proof. Based on the chain rule for mutual information, I(Z; X, V ) = I(Z; V ) + I(Z; X | V ) = max hv(z): v∈V hv(z)=1,∀z v∈V P V (v)E z∼P Z|v log h v (z) + H(V ) + I(Z; X | V ), where the latter equality is given by Theorem 2. Considering I(Z; X | V ), I(Z; X | V ) = E v∼P V E z,x∼P Z,X|v log P Z,X|v (z, x) P Z|v (z)P X|v (x) = E x,v∼P X,V E z∼P Z|x,v log P Z|x,v (z) P Z|v (z) = E x,v∼P X,V E z∼P Z|x,v log P Z|x,v (z) -E v∼P V E z∼P Z|v log P Z|v (z) ≤ E x,v∼P X,V E z∼P Z|x,v log P Z|x,v (z) -E v∼P V E z∼P Z|v log R Z (z) = E x,v∼P X,V E z∼P Z|x,v log P Z|x,v (z) R Z (z) = E x,v∼P X,V D KL P Z|x,v R Z The second equality is obtained by using P Z,X|v (z, x) = P X|v (x)P Z|x,v (z). The inequality is obtained by using D KL [P Z|v R Z ] = E z∼P Z|v log P Z|v (z) -log R Z (z) ≥ 0, where R Z (z) is a variational approximation of the prior marginal distribution of Z. The last equality is obtained from the definition of KL-divergence. The substitution of ( 25) into (24) completes the proof. The existing DA work on semantic segmentation tasks (Luo et al. (2019) ; Song et al. (2019) ) can be explained as the process of fostering close collaboration between the aforementioned information bottleneck terms. The only difference between Theorem 3 for V = {0, 1} and the objective function in (Luo et al. (2019) ) is that (Luo et al. (2019) ) employed the shared encoding P Z|x (z) instead of P Z|x,v (z), whereas some adversarial DA approaches use the unshared one (Tzeng et al. (2017)) . A.3 PROOF OF LEMMA 2 Lemma 2. Let d H (V) = 1 N +1 v∈V d H (D v , D v c ). Let H be a hypothesis class. Then, d H (V) ≤ 1 N (N + 1) v,u∈V d H (D v , D u ) Proof. Let α = 1 N represents the uniform domain weight for the mixture of domain D v c . Then, d H (V) = 1 N + 1 v∈V d H (D v , D v c ) = 1 N + 1 v∈V 2 sup h∈H E x∼P D X v I h(x = 1) -E x∼P D X v c I h(x = 1) = 1 N + 1 v∈V 2 sup h∈H u∈V:u =v α E x∼P D X v I h(x = 1) -E x∼P D X u I h(x = 1) ≤ 1 N + 1 v∈V u∈V:u =v α • 2 sup h∈H E x∼P D X v I h(x = 1) -E x∼P D X u I h(x = 1) = 1 N (N + 1) v,u∈V d H (D v , D u ), where the inequality follows from the triangluar inequality and jensen's inequality.

B EXPERIMENTAL SETUP

In this Section, we describe the datasets, network architecture and hyperparameter configuration.

B.1 DATASETS

We validate the Multi-source Information-regularized Adaptation Networks (MIAN) with the following benchmark datasets: Digits-Five, Office-31 and Office-Home. Every experiment is repeated four times and the average accuracy in target domain is reported. Digits-Five (Peng et al. (2019) ) dataset is a unified dataset including five different digit datasets: MNIST (LeCun et al. (1998) ), MNIST-M (Ganin & Lempitsky (2014) ), Synthetic Digits (Ganin & Lempitsky (2014) ), SVHN, and USPS. Following the standard protocols of unsupervised MDA (Xu et al. (2018) ; Peng et al. (2019) ), we used 25000 training images and 9000 test images sampled from a training and a testing subset for each of MNIST, MNIST-M, SVHN, and Synthetic Digits. For USPS, all the data is used owing to the small sample size. All the images are bilinearly interpolated to 32 × 32. Office-31 (Saenko et al. (2010) ) is a popular benchmark dataset including 31 categories of objects in an office environment. Note that it is a more difficult problem than Digits-Five, which includes 4652 images in total from the three domains: Amazon, DSLR, and Webcam. All the images are interpolated to 224 × 224 using bicubic filters. Office-Home (Venkateswara et al. (2017) ) is a challenging dataset that includes 65 categories of objects in office and home environments. It includes 15,500 images in total from the four domains: Artistic images (Art), Clip Art(Clipart), Product images (Product), and Real-World images (Realworld). All the images are interpolated to 224 × 224 using bicubic filters.

B.2 ARCHITECTURES

For the Digits-Five dataset, we use the same network architecture and optimizer setting as in (Peng et al. (2019) ). For all the other experiments, the results are based on ResNet-50, which is pre-trained on ImageNet. The domain discriminator is implemented as a three-layer neural network. Detailed architecture is shown in Figure 5 .  ; V ) = H(Z) -H(Z | V ). If so, such implicit entropy minimization substantially reduce the upper bound of I(Z; Y ), leading to a increase in optimal joint risk λ * . In other words, the decrease in the entropy of representations may occur as the side effect of I(Z; V ) regularization. Such unexpected side effect of information regularization is highly intertwined with the hidden deterioration of discriminability through adversarial training (Chen et al. (2019) ; Liu et al. (2019) ). Based on these insights, we employ the SVD-entropy H SV D (Z) (Alter et al. (2000) ) of a representation matrix Z to assess the richness of the latent representations during adaptation, since it is difficult to compute H(Z). Note that while H SV D (Z) is not precisely equivalent to H(Z), H SV D (Z) can be used as a proxy of the level of disorder of the given matrix (Newton & DeSalvo ( 2010)). In future works, it would be interesting to evaluate the temporal change in entropy with other metrics. We found that H SV D (Z) indeed decreases significantly during adversarial adaptation, suggesting that some eigenfeatures (or eigensamples) become redundant and, thus, the inherent feature-richness diminishes. (Figure 7a ) To preclude such deterioration, we employ Batch Spectral Penalization (BSP) (Chen et al. (2019) ), which imposes a constraint on the largest singular value to solicit the contribution of other eigenfeatures. The overall objective function in the multi-domain setting is defined as: min F,C L(F, C) + β Î(Z; V ) + γ N +1 i=1 k j=1 s 2 i,j , where β and γ are Lagrangian multipliers and s i,j is the jth singular value from the ith domain. We found that SVD entropy of representations is severely deteriorated especially in the early stages of training, suggesting the possibility of over-regularization. The noisy domain discriminative signals in the initial phase (Ganin et al. (2016) ) may distort and simplify the representations. To circumvent the impaired discriminability in the early stages of the training, the discriminability should be prioritized first with high γ and low β, followed by a gradual decaying and annealing in γ and β, respectively, so that a sufficient level of domain transferability is guaranteed. Based on our temporal analysis, we introduce the training-dependent scaling of β and γ by modifying the progressive training schedule (Ganin et al. (2016) ): β p = β 0 • 2 1 - 1 1 + exp(-σ • p) γ p = γ 0 • 2 1 + exp(-σ • p) -1 , where β 0 and γ 0 are initial values, σ is a decaying parameter, and p is the training progress from 0 to 1. We refer to this version of our model as MIAN-γ. Note that MIAN only includes annealing-β, excluding DBSP. For the proposed method, β 0 is chosen from {0.1, 0.2, 0.3, 0.4, 0.5} for Office-31 and Office-Home dataset, while β 0 = 1.0 is fixed in Digits-Five. γ 0 is fixed to {1e -4 } following Chen et al. (2019) . Table 5 : Hyper parameters configuration. Annealing-β is not adopted in the Digits-Five experiment. Decaying batch spectral penalization is not adopted in the MIAN. SVD-entropy. We evaluated the degree of compromise of SVD-entropy owing to transfer learning. For this, DSLR was fixed as the source domain, and each Webcam and Amazon target domain was used to simulate low (DSLR→Webcam; DW) and high domain (DSLR→Amazon; DA) shift conditions, respectively. SVD-entropy was applied to the representation matrix extracted from ResNet-50 and MIAN (denoted as Adapt in Figure 7a ) with constant β = 0.1. For accurate assessment, we avoided using spectral penalization. As depicted in the Figure 7a , adversarial adaptation, or information regularization, significantly decreases the SVD-entropy of both the source and target domain representations, especially in the early stages of training, indicating that the representations are simplified in terms of feature-richness. Moreover, when comparing the Adapt_DA_source and Adapt_DW_source conditions, we found that SVD-entropy decreases significantly as the degree of domain shift increases. We additionally conducted analyses on temporal changes of SVD entropy by comparing BSP and decaying BSP (Figure 7b ). SVD entropy gradually decreases as the degree of compensation decreases in DBSP which leads to improved transferability and accuracy. Thus DBSP can control the trade-off between the richness of the feature representations and adversarial adaptation as the training proceeds. Ablation study. We performed an ablation study to assess the contribution of the decaying spectral penalization and annealing information regularization to DA performance (Table 6, 7 ). We found that the prioritization of feature-richness in early stages (by controlling β and γ) significantly improves the performance. We also found that the constant penalization schedule (Chen et al. (2019) ) is not reliable and sometimes impedes transferability in the low domain shift condition (Webcam, DSLR in Table 6 ). This implies that the conventional BSP may over-regularize the transferability when the degree of domain shift and SVD-entropy decline are relatively small. 



the notations for the MDA problem in classification. A set of source domains and the target domain are denoted by {D Si } N i=1 and D T , respectively. Let X Si = x of m i.i.d. samples from D Si . Let X T = x T j m j=1 ∼ (D X T ) m be the set of m i.i.d. samples generated from the marginal distribution D X

Figure 1: Proposed neural architecture for multi-source domain adaptation: Multi-source Informationregularized Adaptation Network (MIAN). Multi-source and target domain input data are fed into the encoder. We denote arbitrary source domains as S i and S j without loss of generality. The domain discriminator outputs a logit vector, where each dimension corresponds to each domain. CNN and FCN refers to convolutional neural networks and fully connected neural networks, respectively.

Figure 2: (a)∼(b): Test accuracies for (a) MNIST-M and (b) SVHN as target domains. (c)∼(d): Variance of stochastic gradients after 1000 steps for (c) MNIST-M and (d) SVHN as target domains in log scale. Less is better.

Figure 3: (a) Proxy A-distance. (b)∼(c) Ablation study on the objective of domain discriminator. CEN stands for multi-class cross entropy loss in (13), while BCE stands for binary-class cross entropy losses in (14). (d) Empirical information Î(Z; V ). We treat H(V ) = log |V|. domain. After the early stages of training, we computed the gradients for the weights and biases of both the top and bottom layers of the encoder on the full training set. Figures 2c, 2d show that MIAN with the unified discriminator yields exponentially lower variance of the gradients compared to Multi D. Thus it is more feasible to use the unified discriminator when a large number of domains are given.

Figure 4: Comparison of existing and proposed MDA models. (a) Existing multiple-discriminator based methods align each pairwise source and target domain but may fail by neglecting the domain shift between source domains. It also may suffer from unstable optimization and lack of resourceefficiency. (b) Our proposed model mitigates suggested problems by unifying domain discriminators.

Figure 5: Network architectures. BN denotes Batch Normalization (Ioffe & Szegedy (2015)) and SVD denotes differentiable SVD in PyTorch for MIAN-γ (Section E) Batch Spectral Penalization (BSP, Chen et al. (2019)), Adversarial Discriminative Domain Adaptation (ADDA, Tzeng et al. (2017)), Maximum Classifier Discrepancy (MCD, Saito et al. (2018)), Deep Cocktail Network (DCTN, Xu et al. (2018)), and Moment Matching for Multi-Source Domain Adaptation (M 3 SDA, Peng et al. (2019)).

Figure 7: (a): SVD-entropy analysis. (Office-31; Source domain: DSLR) (b): Comparisons between BSP and DBSP. (Office-31; DSLR → Amazon)

Accuracy (%) on Digits-Five dataset. SYNTH denotes Synthetic Digits(Ganin & Lempitsky (2014)). The baseline results for the Digits-Five dataset were taken from(Peng et al. (2019)).

2) single-best, i.e., the best adaptation performance on the target domain is reported. Owing to limited space, details about simulation settings, used baseline models and datasets are presented in the Supplementary Material. Accuracy (%) on Office-31 dataset.

Accuracy (%) on Office-Home dataset.

Experimental setup. The batch size for each domain is reported.Hyperparameters Details of the experimental setup are summarized in Table4. Other state-of-theart adaptation models are trained based on the same setup except for these cases: DCTN show poor performance with the learning rate shown in Table4for both Office-31 and Office-Home datasets. Following the suggestion of the original authors, 1e -5 is used as a learning rate with the Adam optimizer (Kingma & Ba (2014)); MCD show poor performance for the Office-Home dataset with the learning rate shown in Table4. 1e -4 is selected as a learning rate. For both the proposed and other baseline models, the learning rate of the classifier or domain discriminator trained from the scratch is set to be 10 times of those of ImageNet-pretrained weights, in Office-31 and Office-Home datasets. More hyperparameter configurations are summarized in Table5(Section E) .1 BACKGROUNDS There is little motivation for models to control the complex mutual dependence to domains if reducing the entropy of representations is sufficient to optimize the value of I(Z

Ablation study of decaying batch spectral penalization and annealing information regularization (Office-31). For accurate assessment of extent to which performance improvement is caused by each strategies, γ is fixed as 0 in Annealing-β, and β is fixed as 0.1 in Decaying-γ. Results from Annealing-β are reported in main paper.

annex

Algorithm 1: Multi-source Information-regularized Adaptation Networks (MIAN) mini-batch size for each domain=m, Number of source domains=N , Training iteration T . M=m(N + 1), Set of domain labels V = {1, . . . , N + 1}.Backpropagate gradient of L(h), or the variant (Mao et al. (2017) ), to h.

D ADDITIONAL RESULTS

Visualization of learned latent representations. We visualized domain-independent representations extracted by the input layer of the classifier with t-SNE (Figure 6 ). Before the adaptation process, the representations from the target domain were isolated from the representations from each source domain. However, after adaptation, the representations were well-aligned with respect to the class of digits, as opposed to the domain.Hyperparameter sensitivity. We conducted the analysis on hyperparameter sensitivity with degree of regularization β. The target domain is set as Amazon or Art, where the value β 0 changes from 0.1 to 0.5. The accuracy is high when β 0 is approximately between 0.1 and 0.3. We thus choose β 0 = 0.2 for Office-31, and β 0 = 0.3 for Office-Home. 

E DECAYING BATCH SPECTRAL PENALIZATION

In this Section, we provides details on the Decaying Batch Spectral Penalization (DBSP) which expands MIAN into MIAN-γ. 

