DISENTANGLED CYCLIC RECONSTRUCTION FOR DO-MAIN ADAPTATION

Abstract

The domain adaptation problem involves learning a unique classification or regression model capable of performing on both a source and a target domain. Although the labels for the source data are available during training, the labels in the target domain are unknown. An effective way to tackle this problem lies in extracting insightful features invariant to the source and target domains. In this work, we propose splitting the information for each domain into a task-related representation and its complimentary context representation. We propose an original method to disentangle these two representations in the single-domain supervised case. We then adapt this method to the unsupervised domain adaptation problem. In particular, our method allows disentanglement in the target domain, despite the absence of training labels. This enables the isolation of task-specific information from both domains and a projection into a common representation. The task-specific representation allows efficient transfer of knowledge acquired from the source domain to the target domain. We validate the proposed method on several classical domain adaptation benchmarks and illustrate the benefits of disentanglement for domain adaptation.

1. INTRODUCTION

The wide adoption of Deep Neural Networks in practical supervised learning applications is hindered by their sensitivity to the training data distribution. This problem, known as domain shift, can drastically weaken, in real-life operating conditions, the performance of a model that seemed perfectly efficient in simulation. Learning a model with the goal of making it robust to a specific domain shift is called domain adaptation (DA). Often, the data available to achieve DA consist of a labeled training set from a source domain and an unlabeled sample set from a target domain. This yields the problem of unsupervised domain adaptation (UDA). In this work, we take an information disentanglement perspective on UDA. We argue that a key to efficient UDA lies in separating the necessary information to complete the network's task (classification or regression), from a task-orthogonal information which we call context or style. Disentanglement in the target domain seems however a difficult endeavor since the available data is unlabeled. Our contribution is two-fold. We propose a formal definition of the disentanglement problem for UDA which, to the best of our knowledge, is new. Then we design a new learning method, called DiCyR (Disentangled Cyclic Reconstruction), which relies on cyclic reconstruction of inputs in order to achieve efficient disentanglement, including in the target domain. We derive DiCyR both in the supervised learning and in the UDA cases. This paper is organized as follows. Section 2 presents the required background on supervised learning and UDA, and proposes a definition of disentanglement for UDA. Section 3 reviews recent work in the literature that allow for a critical look at our contribution and put it in perspective. Section 4 introduces DiCyR, first for the single-domain supervised learning case and then for the UDA problem. Finally, Section 5 empirically evaluates DiCyR against state-of-the-art methods and discusses its strengths, weaknesses and variants. Section 6 summarizes and concludes this paper. In this section, we introduce the notations and background upon which we build the contributions of Section 4. Let X be an input space of descriptors and Y an output space of labels. A supervised learning problem is defined by a distribution p s (x, y) over elements of X × Y. In what follows, p s will be called the source distribution. One wishes to estimate a mapping f that minimizes a loss function of the form E (x,y)∼ps l( f (x), y) . The optimal estimator is denoted f and one often writes the distribution P(y|x) as y ∼ f (x) + η, where η captures the deviations between y and f (x). Hence, one tries to learn f . In practice, the loss can only be approximated using a finite set of samples {(x i , y i )} n i=1 all independently drawn from p s and f is a parametric function (such as a deep neural network) of the form y = f (x; θ). Domain adaptation (DA) consists in considering a target distribution p t over X × Y that differs from p s , and the transfer of knowledge from learning in the source domain (p s ) to the target domain (p t ). Specifically, unsupervised DA exploits the knowledge of a labelled training set {(x s i , y s i )} n i=1 sampled according to p s , and an unlabelled data set {(x t i )} m i=1 sampled according to p t . For instance, the source domain data could be a set of labelled photographs of faces, and the target domain data, a set of unlabelled face photographs, taken with a different camera under different exposure conditions. The problem consists in minimizing the target loss E (x,y)∼pt l( f (x), y) . We suppose that a necessary condition to benefit from the knowledge available in the source domain and transfer it to the target domain is the existence of a common information manifold between domains, where an input's projection is sufficient to predict the labels. We call this useful information task-specific or task-related. The complimentary information should be called task-orthogonal; it is composed of information that is present in the input but is not relevant to the task at hand. For the sake of naming simplicity, we will call this information style. However we insist that this should not be confused with the classical notion of style. Let Π τ : X → T and Π σ : X → S denote two projection operators, where T and S denote respectively the latent task-related information space and the latent style-related information space. Let Π be the joint projection Π(x) = (Π τ (x), Π σ (x)). Conversely, we shall note Π : T × S → X a reconstruction operator. And finally, c : T → Y will denote the labeling operator which only uses information from T . We consider that the information of the elements of X is correctly disentangled by Π = (Π τ , Π σ ) if one can find Π and c such that: C1: c • Π τ minimizes the loss (and thus fits f on the appropriate domain), C2: Π • Π fits the identity operator id X , C3: With X, T, S the random variables in X , T , S, the mutual information I(T, S|X) = 0, C4: There is no function g : T → X such that g • Π τ = id X , Condition C1 imposes that the projection into T retains enough information to correctly label samples. Condition C2 imposes that all the information necessary for the reconstruction is preserved by the separation performed by Π. Condition C3 states that no information is present in both T and S. Condition C4 impose that the information contained in T alone is insufficient to reconstruct an input, and thus the information of S is necessary. Note that the symmetrical condition is unnecessary, since the combination of C1 and C3 already guarantees that S cannot contain the task-related information. Overall, solving this disentanglement problem for DA implies finding a quadruplet Π τ , Π σ , Π, c that meets the conditions above. In particular, note that conditions C3 and C4 open a perspective to a formulation of disentanglement in the general case.

3. RELATED WORK

Disentanglement between the domain-invariant, task-related information and the domain-specific, task-orthogonal, style information is a desirable property to have for DA. In the next paragraphs, we cover important work in representation disentanglement, domain adaptation and their interplay. Before deep learning became popular, Tenenbaum & Freeman (2000) presented a method using bi-linear models able to separate style from content. More recently, methods based on generative

