EMTL: A GENERATIVE DOMAIN ADAPTATION APPROACH

Abstract

We propose an unsupervised domain adaptation approach based on generative models. We show that when the source probability density function can be learned, one-step Expectation-Maximization iteration plus an additional marginal density function constraint will produce a proper mediator probability density function to bridge the gap between the source and target domains. The breakthrough is based on modern generative models (autoregressive mixture density nets) that are competitive to discriminative models on moderate-dimensional classification problems. By decoupling the source density estimation from the adaption steps, we can design a domain adaptation approach where the source data is locked away after being processed only once, opening the door to transfer when data security or privacy concerns impede the use of traditional domain adaptation. We demonstrate that our approach can achieve state-of-the-art performance on synthetic and real data sets, without accessing the source data at the adaptation phase.

1. INTRODUCTION

In the classical supervised learning paradigm, we assume that the training and test data come from the same distribution. In practice, this assumption often does not hold. When the pipeline includes massive data labeling, models are routinely retrained after each data collecion campaign. However, data labeling costs often make retraining impractical. Without labeled data, it is still possible to train the model by using a training set which is relevant but not identically distributed to the test set. Due to the distribution shift between the training and test sets, the performance usually cannot be guaranteed. Domain adaptation (DA) is a machine learning subdomain that aims at learning a model from biased training data. It explores the relationship between source (labeled training data) and target (test data) domains to find the mapping function and fix the bias, so that the model learned on the source data can be applied in target domain. Usually some target data is needed during the training phase to calibrate the model. In unsupervised domain adaptation (UDA) only unlabeled target data is needed during training phase. UDA is an appealing learning paradigm since obtaining unlabeled data is usually easy in a lot of applications. UDA allows the model to be deployed in various target domains with different shifts using a single labeled source data set. Due to these appealing operational features, UDA has became a prominent research field with various approaches. Kouw & Loog (2019) and Zhuang et al. (2020) surveyed the latest progress on UDA and found that most of the approaches are based on discriminative models, either by reweighting the source instances to approximate the target distribution or learning a feature mapping function to reduce the statistical distance between the source and target domains. After calibrating, a discriminative model is trained on the adjusted source data and used in target domain. In this workflow, the adaptation algorithm usually have to access the source and target data simultaneously. However, accessing the source data during the adaptation phase is not possible when the source data is sensitive (for example because of security or privacy issues). In particular, in our application workflow an industrial company is selling devices to various service companies which cannot share their customer data with each other. The industrial company may contract with one of the service companies to access their data during an R&D phase, but this data will not be available when the industrial company sells the device (and the predictive model) to other service companies. In this paper we propose EMTL, a generative UDA algorithm for binary classification that does not have to access the source data during the adaptation phase. We use density estimation to estimate the joint source probability function p s (x, y) and the marginal target probability function p t (x) and use them for domain adaption. To solve the data security issue, EMTL decouples source density estimation from the adaptation steps. In this way, after the source preprocessing we can put away or delete the source data. Our approach is motivated by the theory on domain adaptation (Ben-David et al., 2010) which claims that the error of a hypothesis h on the target domain can be bounded by three items: the error on the source domain, the distance between source and target distributions, and the expected difference in labeling functions. This theorem motivated us to define a mediator density function p m (x, y) i) whose conditional probability y| x is equal to the conditional probability of the source and ii) whose marginal density on x is equal to the marginal density of the target. We can then construct a Bayes optimal classifier on the target domain under the assumption of covariate shift (the distribution y| x is the same in the source and target domains). Our approach became practical with the recent advances in (autoregressive) neural density estimation (Uria et al., 2013) . We learn p m (x, y) from p s (x, y) and p t (x) to bridge the gap between the source and target domains. We regard the label on the target data as a latent variable and show that if p s (x |y = i) be learned perfectly for i ∈ {0, 1}, then a one-step Expectation-Maximization (and this is why our algorithm named EMTL) iteration will produce a density function p m (x, y) with the following properties on the target data: i) minimizing the Kullback-Leibler divergence between p m (y i | x i ) and p s (y i | x i ); ii) maximizing the log-likelihood log p m (x i ). Then, by adding an additional marginal constraint on p m (x i ) to make it close to p t (x i ) on the target data explicitly, we obtain the final objective function for EMTL. Although this analysis assumes a simple covariate shift , we will experimentally show that EMTL can go beyond this assumption and work well in other distribution shifts. We conduct experiments on synthetic and real data to demonstrate the effectiveness of EMTL. First, we construct a simple two-dimensional data set to visualize the performance of EMTL. Second, we use UCI benchmark data sets and the Amazon reviews data set to show that EMTL is competitive with state-of-the-art UDA algorithms, without accessing the source data at the adaptation phase. To our best knowledge, EMTL is the first work using density estimation for unsupervised domain adaptation. Unlike other existing generative approaches (Kingma et al., 2014; Karbalayghareh et al., 2018; Sankaranarayanan et al., 2018) , EMTL can decouple the source density estimation process from the adaption phase and thus it can be used in situations where the source data is not available at the adaptation phase due to security or privacy reasons. Zhuang et al. (2020 ), Kouw & Loog (2019) and Pan & Yang (2009) categorize DA approaches into instance-based and feature-based techniques. Instance-based approaches reweight labeled source samples according to the ratio of between the source and the target densities. Importance weighting methods reweight source samples to reduce the divergence between the source and target densities (Huang et al., 2007; Gretton et al., 2007; Sugiyama et al., 2007) . In contrast, class importance weighting methods reweight source samples to make the source and target label distribution the same (Azizzadenesheli et al., 2019; Lipton et al., 2018; Zhang et al., 2013) . Feature-based approaches learn a new representation for the source and the target by minimizing the divergence between the source and target distributions. Subspace mapping methods assume that there is a common subspace between the source and target (Fernando et al., 2013; Gong et al., 2012) . Courty et al. (2017) proposed to use optimal transport to constrain the learning process of the transformation function.

2. RELATED WORK

Other methods aim at learning a representation which is domain-invariant among domains (Gong et al., 2016; Pan et al., 2010) . Besides these shallow models, deep learning has also been widely applied in domain adaptation (Tzeng et al., 2017; Ganin et al., 2016; Long et al., 2015) . DANN (Ganin et al., 2016 ) learns a representation using a neural network which is discriminative for the source task while cannot distinguish the source and target domains from each other. Kingma et al. (2014) and Belhaj et al. (2018) proposed a variational inference based semi-supervised learning approach by regarding the missing label as latent variable and then performing posterior inference.

