A SIMPLE UNIFIED INFORMATION REGULARIZATION FRAMEWORK FOR MULTI-SOURCE DOMAIN ADAPTA-TION

Abstract

Adversarial learning strategy has demonstrated remarkable performance in dealing with single-source unsupervised Domain Adaptation (DA) problems, and it has recently been applied to multi-source DA problems. Although most existing DA methods use multiple domain discriminators, the effect of using multiple discriminators on the quality of latent space representations has been poorly understood. Here we provide theoretical insights into potential pitfalls of using multiple domain discriminators: First, domain-discriminative information is inevitably distributed across multiple discriminators. Second, it is not scalable in terms of computational resources. Third, the variance of stochastic gradients from multiple discriminators may increase, which significantly undermines training stability. To fully address these issues, we situate adversarial DA in the context of information regularization. First, we present a unified information regularization framework for multi-source DA. It provides a theoretical justification for using a single and unified domain discriminator to encourage the synergistic integration of the information gleaned from each domain. Second, this motivates us to implement a novel neural architecture called a Multi-source Information-regularized Adaptation Networks (MIAN). The proposed model significantly reduces the variance of stochastic gradients and increases computational-efficiency. Large-scale simulations on various multi-source DA scenarios demonstrate that MIAN, despite its structural simplicity, reliably outperforms other state-of-the-art methods by a large margin especially for difficult target domains.

1. INTRODUCTION

Although a large number of studies have demonstrated the ability of deep neural networks to solve challenging tasks, the tasks solved by networks are mostly confined to a similar type or a single domain. One remaining challenge is the problem known as domain shift (Gretton et al. (2009) ), where a direct transfer of information gleaned from a single source domain to unseen target domains may lead to significant performance impairment. Domain adaptation (DA) approaches aim to mitigate this problem by learning to map data of both domains onto a common feature space. Whereas several theoretical results (Ben-David et al. (2007) ), the potential pitfalls of this setting have not been fully explored. The existing works do not provide a theoretical guarantee that the unnecessary domain-specific information is fully filtered out, because the domain-discriminative information is inevitably distributed across multiple discriminators. For example, the multiple domain discriminators focus only on estimating the domain shift between source domains and the target, while the discrepancies between the source domains are neglected, making it hard to align all the given domains. This necessitates garnering the domain-discriminative information with a unified discriminator. Moreover, the multiple domain discriminator setting is not scalable in terms of computational resources especially when large number of source domains are given, e.g., medical reports from multiple patients. Finally, it may undermine the stability of training, as earlier works solve multiple independent adversarial minimax problems. To overcome such limitations, we propose a novel MDA method, called Multi-source Informationregularized Adaptation Networks (MIAN), that constrains the mutual information between latent representations and domain labels. First, we show that such mutual information regularization is closely related to the explicit optimization of the H-divergence between the source and target domains. This affords the theoretical insight that the conventional adversarial DA can be translated into an information-theoretic-regularization problem. Second, based on our findings, we propose a new optimization problem for MDA: minimizing adversarial loss over multiple domains with a single domain discriminator. We show that the domain shift between each source domain can be indirectly penalized, which is known to be beneficial in MDA (Li et al. (2018) ; Peng et al. ( 2019)), with a single domain discriminator. Moreover, by analyzing existing studies in terms of information regularization, we found that the variance of the stochastic gradients increases when using multiple discriminators. Despite its structural simplicity, we found that MIAN works efficiently across a wide variety of MDA scenarios, including the DIGITS-Five (Peng et al. ( 2019 2018) computed a softmax-transformed weight vector using the empirical Wasserstein-like measure instead of the empirical risks. Compared to the proposed methods without robust theoretical justifications, our analysis does not require any assumption or estimation for the domain coefficients. In our framework, the representations are distilled to be independent of the domain, thereby rendering the performance relatively insensitive to explicit weighting strategies.



; Blitzer et al. (2008); Zhao et al. (2019a)) and algorithms for DA (Long et al. (2015; 2017); Ganin et al. (2016)) have focused on the case in which only a single-source domain dataset is given, we consider a more challenging and generalized problem of knowledge transfer, referred to as Multi-source unsupervised DA (MDA). Following a seminal theoretical result on MDA (Blitzer et al. (2008); Ben-David et al. (2010)), technical advances have been made, mainly on the adversarial methods. (Xu et al. (2018); Zhao et al. (2019c)). While most of adversarial MDA methods use multiple independent domain discriminators (Xu et al. (2018); Zhao et al. (2018); Li et al. (2018); Zhao et al. (2019c;b)

)),Office-31 (Saenko et al. (2010)), and Office-Home datasets(Venkateswara et al. (2017)). Intriguingly, MIAN reliably and significantly outperformed several state-of-the-art methods that either employ a domain discriminator separately for each source domain(Xu et al. (2018)) or align the moments of deep feature distribution for every pairwise domain(Peng et al. (2019)).2 RELATED WORKSSeveral DA methods have been used in attempt to learn domain-invariant representations. Along with the increasing use of deep neural networks, contemporary work focuses on matching deep latent representations from the source domain with those from the target domain. Several measures have been introduced to handle domain shift, such as maximum mean discrepancy (MMD) (Long et al. (2014; 2015)), correlation distance (Sun et al. (2016); Sun & Saenko (2016)), and Wasserstein distance (Courty et al. (2017)). Recently, adversarial DA methods (Ganin et al. (2016); Tzeng et al. (2017); Hoffman et al. (2017); Saito et al. (2018; 2017)) have become mainstream approaches owing to the development of generative adversarial networks (Goodfellow et al. (2014)). However, the abovementioned single-source DA approaches inevitably sacrifice performance for the sake of multi-source DA. Some MDA studies (Blitzer et al. (2008); Ben-David et al. (2010); Mansour et al. (2009); Hoffman et al. (2018)) have provided the theoretical background for algorithm-level solutions. (Blitzer et al. (2008); Ben-David et al. (2010)) explore the extended upper bound of true risk on unlabeled samples from the target domain with respect to a weighted combination of multiple source domains. Following these theoretical studies, MDA studies with shallow models (Duan et al. (2012b;a); Chattopadhyay et al. (2012)) as well as with deep neural networks (Mancini et al. (2018); Peng et al. (2019); Li et al. (2018)) have been proposed. Recently, some adversarial MDA methods have also been proposed. Xu et al. (2018) implemented a k-way domain discriminator and classifier to battle both domain and category shifts. Zhao et al. (2018) also used multiple discriminators to optimize the average case generalization bounds. Zhao et al. (2019c) chose relevant source training samples for the DA by minimizing the empirical Wasserstein distance between the source and target domains. Instead of using separate encoders, domain discriminators or classifiers for each source domain as in earlier works, our approach uses unified networks, thereby improving resource-efficiency and scalability. Several existing MDA works have proposed methods to estimate the source domain weights following (Blitzer et al. (2008); Ben-David et al. (2010)). Mansour et al. (2009) assumed that the target hypothesis can be approximated by a convex combination of the source hypotheses. (Peng et al. (2019); Zhao et al. (2018)) suggested ad-hoc schemes for domain weights based on the empirical risk of each source domain. Li et al. (

