DISENTANGLED CYCLIC RECONSTRUCTION FOR DO-MAIN ADAPTATION

Abstract

The domain adaptation problem involves learning a unique classification or regression model capable of performing on both a source and a target domain. Although the labels for the source data are available during training, the labels in the target domain are unknown. An effective way to tackle this problem lies in extracting insightful features invariant to the source and target domains. In this work, we propose splitting the information for each domain into a task-related representation and its complimentary context representation. We propose an original method to disentangle these two representations in the single-domain supervised case. We then adapt this method to the unsupervised domain adaptation problem. In particular, our method allows disentanglement in the target domain, despite the absence of training labels. This enables the isolation of task-specific information from both domains and a projection into a common representation. The task-specific representation allows efficient transfer of knowledge acquired from the source domain to the target domain. We validate the proposed method on several classical domain adaptation benchmarks and illustrate the benefits of disentanglement for domain adaptation.

1. INTRODUCTION

The wide adoption of Deep Neural Networks in practical supervised learning applications is hindered by their sensitivity to the training data distribution. This problem, known as domain shift, can drastically weaken, in real-life operating conditions, the performance of a model that seemed perfectly efficient in simulation. Learning a model with the goal of making it robust to a specific domain shift is called domain adaptation (DA). Often, the data available to achieve DA consist of a labeled training set from a source domain and an unlabeled sample set from a target domain. This yields the problem of unsupervised domain adaptation (UDA). In this work, we take an information disentanglement perspective on UDA. We argue that a key to efficient UDA lies in separating the necessary information to complete the network's task (classification or regression), from a task-orthogonal information which we call context or style. Disentanglement in the target domain seems however a difficult endeavor since the available data is unlabeled. Our contribution is two-fold. We propose a formal definition of the disentanglement problem for UDA which, to the best of our knowledge, is new. Then we design a new learning method, called DiCyR (Disentangled Cyclic Reconstruction), which relies on cyclic reconstruction of inputs in order to achieve efficient disentanglement, including in the target domain. We derive DiCyR both in the supervised learning and in the UDA cases. This paper is organized as follows. Section 2 presents the required background on supervised learning and UDA, and proposes a definition of disentanglement for UDA. Section 3 reviews recent work in the literature that allow for a critical look at our contribution and put it in perspective. Section 4 introduces DiCyR, first for the single-domain supervised learning case and then for the UDA problem. Finally, Section 5 empirically evaluates DiCyR against state-of-the-art methods and discusses its strengths, weaknesses and variants. Section 6 summarizes and concludes this paper.

2. PROBLEM DEFINITION

In this section, we introduce the notations and background upon which we build the contributions of Section 4. Let X be an input space of descriptors and Y an output space of labels. A supervised learning problem is defined by a distribution p s (x, y) over elements of X × Y. In what follows, p s will be called the source distribution. One wishes to estimate a mapping f that minimizes a loss function of the form E (x,y)∼ps l( f (x), y) . The optimal estimator is denoted f and one often writes the distribution P(y|x) as y ∼ f (x) + η, where η captures the deviations between y and f (x). Hence, one tries to learn f . In practice, the loss can only be approximated using a finite set of samples {(x i , y i )} n i=1 all independently drawn from p s and f is a parametric function (such as a deep neural network) of the form y = f (x; θ). Domain adaptation (DA) consists in considering a target distribution p t over X × Y that differs from p s , and the transfer of knowledge from learning in the source domain (p s ) to the target domain (p t ). Specifically, unsupervised DA exploits the knowledge of a labelled training set {(x s i , y s i )} n i=1 sampled according to p s , and an unlabelled data set {(x t i )} m i=1 sampled according to p t . For instance, the source domain data could be a set of labelled photographs of faces, and the target domain data, a set of unlabelled face photographs, taken with a different camera under different exposure conditions. The problem consists in minimizing the target loss E (x,y)∼pt l( f (x), y) . We suppose that a necessary condition to benefit from the knowledge available in the source domain and transfer it to the target domain is the existence of a common information manifold between domains, where an input's projection is sufficient to predict the labels. We call this useful information task-specific or task-related. The complimentary information should be called task-orthogonal; it is composed of information that is present in the input but is not relevant to the task at hand. For the sake of naming simplicity, we will call this information style. However we insist that this should not be confused with the classical notion of style. Let Π τ : X → T and Π σ : X → S denote two projection operators, where T and S denote respectively the latent task-related information space and the latent style-related information space. Let Π be the joint projection Π(x) = (Π τ (x), Π σ (x)). Conversely, we shall note Π : T × S → X a reconstruction operator. And finally, c : T → Y will denote the labeling operator which only uses information from T . We consider that the information of the elements of X is correctly disentangled by Π = (Π τ , Π σ ) if one can find Π and c such that: C1: c • Π τ minimizes the loss (and thus fits f on the appropriate domain), C2: Π • Π fits the identity operator id X , C3: With X, T, S the random variables in X , T , S, the mutual information I(T, S|X) = 0, C4: There is no function g : T → X such that g • Π τ = id X , Condition C1 imposes that the projection into T retains enough information to correctly label samples. Condition C2 imposes that all the information necessary for the reconstruction is preserved by the separation performed by Π. Condition C3 states that no information is present in both T and S. Condition C4 impose that the information contained in T alone is insufficient to reconstruct an input, and thus the information of S is necessary. Note that the symmetrical condition is unnecessary, since the combination of C1 and C3 already guarantees that S cannot contain the task-related information. Overall, solving this disentanglement problem for DA implies finding a quadruplet Π τ , Π σ , Π, c that meets the conditions above. In particular, note that conditions C3 and C4 open a perspective to a formulation of disentanglement in the general case.

3. RELATED WORK

Disentanglement between the domain-invariant, task-related information and the domain-specific, task-orthogonal, style information is a desirable property to have for DA. In the next paragraphs, we cover important work in representation disentanglement, domain adaptation and their interplay. Before deep learning became popular, Tenenbaum & Freeman (2000) presented a method using bi-linear models able to separate style from content. More recently, methods based on generative models have demonstrated the ability to disentangle factors of variations from elements of a single domain (Rifai et al., 2012; Mathieu et al., 2016; Chen et al., 2016; Higgins et al., 2017; Sanchez et al., 2019) . In a cross-domain setting, Gonzalez-Garcia et al. (2018) use pairs of images with the same labels from different domains to separate representations into a shared information common to both domains and a domain-exclusive information. We note that these approaches do not explicitly aim at respecting all conditions listed in Section 2. Additionally, most require labeled datasets (and in some cases even paired datasets) and thus do not address the unsupervised DA problem. One approach to UDA consists in aligning the source and target distributions statistics, a topic closely related to batch normalization (Ioffe & Szegedy, 2015) . Sun et al. (2017) minimize the distance between the covariance matrices of the features extracted from the source and target domains. Assuming the domain-specific information is contained inside the batch normalization layers, Li et al. (2017) align the batch statistics by adopting a specific normalization for each domain. Cariucci et al. (2017) aim to align source and target feature distributions to a reference one and introduce domain alignment layers to automatically learn the degree of feature alignment needed at different levels of the network. Similarly, Roy et al. (2019) replace batch normalization layers with domain alignment layers implementing a so-called feature whitening. A major asset of these methods is the possibility to be used jointly with other DA methods (including the one we propose in Section 4). These methods jointly learn a common representation for elements from both domains. Conversely, Liang et al. (2020) freeze the representations learned in the source domain before training a target-specific encoder to align the representations of the target elements by maximizing the mutual information between intermediate feature representations and outputs of the classifier. Ensemble methods have also been applied to UDA (Laine & Aila, 2017; Tarvainen & Valpola, 2017) . French et al. (2018) combine stochastic data augmentation with self-ensembling to minimize the prediction differences between a student and a teacher network in the target domain. Another approach involves learning domain-invariant features, that do not allow to discriminate whether a sample belongs to the source or target domain, while still permitting accurate labeling in the source domain. This approach relies on the assumption that such features allow efficient labeling in the target domain. Ghifary et al. (2016) build a two-headed network sharing common layers; one head performs classification in the source domain, while the second is a decoder that performs reconstruction for target domain elements. Ganin et al. (2016) propose the DANN method and introduce Gradient Reversal Layers to connect a domain discriminator and a feature extractor. These layers invert the gradient sign during back-propagation so that the feature extractor is trained to fool the domain discriminator. Shen et al. (2018) modify DANN and replace the domain discriminator by a network that approximates the Wasserstein distance between domains. Tzeng et al. (2017) optimize, in an adversarial setting, a generator and a discriminator with an inverted label loss. Other methods focus on explicitly disentangling an information shared between domains (analogous to the domain-invariant features above) from a domain-specific information. Inspired by Chen et al. (2016) , Liu et al. (2018b) isolate a latent factor, representing the domain information, from the rest of an encoding, by maximizing the mutual information between generated images and this latent factor. Some domain information may still be present in the remaining part of the encoding and thus may not comply with conditions C3 and C4. Liu et al. (2018a) combine an encoder, an image generator, a domain discriminator, and a fake images discriminator to produce cross-domain images. The encoder is trained jointly with the domain discriminator to produce domain-invariant representations. Li et al. (2020) disentangle a latent representation into a global code and a local code. The global code captures category information via an encoder with a prior, and the local code is transferable across domains, which captures the style-related information via an implicit decoder. Bousmalis et al. (2016) also produce domain-invariant features by training a shared encoder to fool a domain discriminator. They train two domain-private encoders with a difference loss that encourages orthogonality between the shared and the private representations (similarly to condition C3). Cao et al. (2018) ; Cai et al. (2019) ; Peng et al. (2019) combine a domain discriminator with an adversarial classifier to separate the information shared between domains from the domain-specific information. All these methods build a shared representation that prevents discriminating between source and target domains, while retaining enough information to correctly label samples from the source domain. However, because they rely on an adversarial classifier that requires labeled data, they do not guarantee that the complimentary, domain-specific information for samples in the target domain does not contain information that overlaps with the shared representation. In other words, they only enforce C3 in the source domain. They rely on the assumption that the disentanglement will still hold when applied on target domain elements, which might not be true. Another identified weakness in methods that achieve a domain-invariant feature space is that their representations might not allow for accurate labeling in the target domain. Indeed, feature alignment does not necessarily imply a correct mapping between domains. To illustrate this point, consider a binary classification problem (classes c 1 and c 2 ) and two domains (d 1 and d 2 ). Let (c 1 , d 1 ) denote samples of class c 1 in d 1 . It is possible to construct an encoding that projects ( c 1 , d 1 ) and (c 2 , d 2 ) to the same feature values. The same holds for ( c 1 , d 2 ) and (c 2 , d 1 ) for different feature values. This encoding allows discriminating between classes in d 1 . It also fools a domain discriminator since it does not allow predicting the original domain of a projected element. However, applying the classification function learned on d 1 to the projected d 2 elements leads to catastrophic predictions. Transforming a sample from one domain to the other, while retaining its label information can be accomplished by image-to-image translation methods. Hoffman et al. (2018) extend CycleGAN's cycle consistency (Zhu et al., 2017) with a semantic consistency to translate from source to target domains. The translated images from the source domain to the target domain are then used to train a classifier on the target domain using the source labels. Similarly, Russo et al. (2018) train two conditional GANs (Mirza & Osindero, 2014) to learn bi-directional image mappings constrained by a class consistency loss and use a source domain classifier to produce pseudo-labels on source-like transformed target samples. By relaxing CycleGAN's cycle consistency constraint and integrating the discriminator in the training phase, Hosseini-Asl et al. ( 2019) address the DA problem in the specific setting where the number of target samples is limited. Takahashi et al. (2020) use a CycleGAN to generate cross-domain pseudo-pairs and train two domain-specific encoders to align features extracted from each pseudo-pair in the feature space. A major asset of the method is to address the class-unbalanced UDA problem by oversampling with the learned data augmentation. Yang et al. (2019) use separate encoders to produce domain-invariant and domain-specific features in both domains. They jointly train these encoders with two generators to produce cross-domain elements able to fool domain-specific discriminators. Using a cyclic loss on features, they force the information contained in the representation to be preserved during the generation of cross-domain elements. However, the cyclic loss on features does not prevent the information sharing between features expressed in C3. More importantly it does not prevent the domain-specific features to be constant. A major drawback of these methods lies in the instability during training that might be caused by min-max optimization problem induced by the adversarial training of generators and discriminators. In the next section, we introduce a method that does not rely on a domain discriminator and an adversarial label predictor, but directly minimizes the information sharing between representations. This allows to guarantee that there is no information redundancy between the task-related and the task-orthogonal style information in both the source and the target domains. Along the way, it provides an efficient mechanism to disentangle the task-related information from the style information in the single domain case. Our method combines information disentanglement, intra-domain and cross-domain cyclic consistency to enforce a more principled mapping between each domain.

RECONSTRUCTION

First, we propose an original method to disentangle the task-related information from the style information for a single domain in a supervised learning setting. In a second step, we propose an adaptation of this method to learn these disentangled representations in both domains for UDA. This disentanglement allows, in turn, to efficiently predict labels in the target domain.

4.1. TASK-STYLE DISENTANGLEMENT IN THE SUPERVISED CASE

Our approach consists in estimating jointly Π, Π and c as a deep feed-forward neural network. We shall note θ Π , θΠ, and θ c the parameters of the respective sub-parts of the network. Π • Π takes the form of an auto-encoder, while Π • c is a task-related (classification or regression) network. Figure 1a summarizes the global architecture which we detail in the following paragraphs. In order to achieve condition C3, we exploit Gradient Reversal Layers (Ganin et al., 2016, GRL) . We train two side networks r τ : S → T and r σ : T → S whose purpose is to attempt to predict T given S, and S given T respectively. For a given x, let us write (τ, σ) = Π(x), τ = r τ (σ), and σ = r σ (τ ). We train r τ and r σ to minimize the losses L rτ = τ -τ 2 and L rσ = σ -σ 2 . Let L inf o = L rτ + L rσ denote the combination of these losses. We connect these two sub-networks to the whole architecture using GRLs. GRLs behave as the identity function during the forward pass and invert the gradient sign during the backward pass, hence pushing the parameters to maximize the output loss. During training, this architecture constrains Π to produce features in T and S with the least information shared between them. Consequently, the update of θ Π follows +∇ θΠ L inf o . This constraint efficiently avoids information redundancy between T and S. However, it does not avoid all the information being pushed into T . Preventing this undesirable behavior is the purpose of condition C4. To that end, we use a cyclic reconstruction scheme. Consider two elements x and x from X , and their associated (τ, σ) = Π(x) and (τ , σ ) = Π(x ). Let x = Π(τ, σ ) be the reconstruction of τ that uses the style σ of x . A correct allotment of the information between T and S requires that the task and style information be preserved in (τ , σ) = Π(x). So, we wish to have τ as close as possible to τ , or, alternatively, to have c(τ ) as close as possible to c(τ ). Similarly, we wish to have σ as close as possible to σ . To enforce C4 and avoid the degenerate case where the encoder predicts a constant style for all samples, we force σ to lie sufficiently far from σ to avoid style confusion. We achieve this with a triplet loss (Schroff et al., 2015) using σ as the anchor, σ and σ as, respectively, the positive and negative inputs, and a margin m. Thus C4 results in minimizing the cyclic reconstruction loss L cyclic = τ -τ 2 + max{ σ -σ 2 -σ -σ 2 + m, 0}. Overall, the gradient-based update procedure of the network parameters boils down to: θ Π ← θ Π -α∇ θΠ (L task + L reco -L inf o + L cyclic ) , θΠ ← θΠ -α∇ θ Π (L reco + L cyclic ), θ c ← θ c -α∇ θc L task , θ rτ ← θ rτ -α∇ θr τ L rτ , θ rσ ← θ rσ -α∇ θr σ L rσ . We call this method DiCyR for Disentangled Cyclic Reconstruction.

4.2. TASK-STYLE DISENTANGLEMENT IN THE UNSUPERVISED DOMAIN ADAPTATION CASE

We propose a variation of DiCyR for UDA, where we replace the decoder Π by two domain-specific decoders, Πs and Πt . We shall compensate for the lack of labeled data in the target domain by computing cross-domain cyclic reconstructions. Let (x s , y s ) be a sample from the source domain and x t be a sample from the target domain. Let us denote (τ s , σ s ) = Π(x s ) and (τ t , σ t ) = Π(x t ), the corresponding projections in the latent task and style-related information spaces. Then one can define, as in the previous section, L task as the taskspecific loss on the source domain, and L recos and L recot as the reconstruction losses in the source and target domains respectively. As previously, we constrain the task-related representation and the style representation not to share information using two networks r τ and r σ , connected to the main architecture by GRL layers (Figure 1b ), allowing the definition of the L rτ , L rσ and L inf o losses. Lastly, we exploit cyclic reconstructions in both domains to correctly disentangle the information and hence define the same L cyclic loss as above. This disentanglement in the target domain separates the global information in two but does not guarantee that what is being pushed into τ is really the task-related information. This can only be enforced by cross-domain knowledge (since no correct labels are available in the target domain). Thus, finally, we would like to allow projections from one domain into the other while retaining the task-related information, hence allowing domain adaption. Using the notations above, we construct x ts = Πt (τ s , σ t ), the reconstruction of x s 's task-related information, in the style of x t . This creates an artificial sample in the target domain, whose label is y s . Then, with (τ ts , σ ts ) = Π(x ts ), one wishes to have τ ts match closely τ s (or, alternatively, c(τ ts ) match closely y s ) in order to prevent the loss of task information during the cross-domain projection and thus to constrain the task representations to be domain-invariant. Symmetrically, one can construct the artificial sample x st = Πs (τ t , σ s ) and enforce that τ st closely matches τ t . Note that the label of x st is unknown and yet it is still possible to enforce the disentanglement by cyclic reconstruction. Overall, these terms boil down to a cross-domain cyclic reconstruction loss for UDA L domain cyclic = τ s -τ ts 2 + τ t -τ st 2 . Finally, the network parameters are updated according to: θ Π ← θ Π -α∇ θΠ (L task + L recos + L recot -L inf o + L cyclic + L domain cyclic ) , θΠ s ← θΠ s -α∇ θ Πs (L recos + L cyclic ), θΠ t ← θΠ t -α∇ θ Πt (L recot + L cyclic ), θ c ← θ c -α∇ θc L task , θ rτ ← θ rτ -α∇ θr τ L rτ , θ rσ ← θ rσ -α∇ θr σ L rσ .

5. EXPERIMENTAL RESULTS AND DISCUSSION

We first evaluate DiCyR's ability to disentangle the task-related information from the style information in the supervised context. Then we demonstrate DiCyR's efficiency on UDA.foot_0 .

5.1. SUPERVISED DISENTANGLEMENT

We evaluate the disentanglement performance of DiCyR by following the protocol introduced by Mathieu et al. ( 2016). Since we do not use generative models, we only focus on their two first items: swapping and retrieval. We evaluate DiCyR on the SVHN (Netzer et al., 2011) , and 3D Shapes (Burgess & Kim, 2018) disentanglement benchmarks. The task is predicting the central digit in the image for the SVHN dataset, and the shape of the object in the scene for the 3D Shapes dataset. Swapping involves swapping styles between samples and visually assessing the realism of the generated image. It combines the task-related information τ i of a sample x i with the style σ j of another sample x j . We use the decoder to produce an output xij . Figure 2 shows randomly generated outputs on the two datasets. DiCyR produces visually realistic artificial images with the desired styles. Retrieval concerns finding, in the dataset, the nearest neighbors in the embedding space for an image query. We carry out this search for nearest neighbors using the Euclidean distance on both the task-related and the style representations. A good indicator of the effectiveness of the information disentanglement would be to observe neighbors with the same labels as the query when computing distances on the task-related information space, and neighbors with similar style when using the style information. Figure 3 demonstrate that the neighbors found when using the task-related information are samples with the same label as the query's label and that the neighbors found using the style representation share many characteristics with the query but not necessarily the same labels. We ran a quantitative evaluation of disentanglement by training a neural network classifier with a single hidden layer of 32 units to predict labels, using either the task-related information alone, or the style information alone. If the information is correctly separated, we expect the classifier trained with task-related information only to get similar performance to a classifier trained with full information. Conversely, the classifier trained with the style information only should reach similar performance to a random guess (10% accuracy on SVHN, 25% on 3D Shapes). Table 1 reports the obtained testing accuracies. It appears that the task-related representation contains enough information to correctly predict labels. We also observe that full disentanglement is closely but not perfectly achieved as the classifier trained only with style information behaves slightly better that random choice. To quantify how much style information is being unduly encoded in the task-related representation, we ran a similar experiment to predict the five other style variation factors in 3D Shapes (floor hue, wall hue, object hue, scale and orientation). The trained classifier reaches accuracies that are very close to a random guess, thus validating the disentanglement quality.

5.2. UNSUPERVISED DOMAIN ADAPTATION PROBLEM

We evaluate DiCyR by performing domain adaptation between the MNIST (LeCun et al., 1998) , SVHN, and USPS (Hull, 1994) datasets, and between the Syn-Signs (Ganin & Lempitsky, 2015) and the GTSRB (Stallkamp et al., 2011) datasets. Following common practice in the literature, we trained our network on five different settings: MNIST→USPS, USPS→MNIST, SVHN→MNIST, MNIST→SVHN, and Syn-Signs→GTSRB. We measure the classification performance in the target domain and compare it with state-of-the-art methods (Table 2 ). We also compare with a baseline classifier that is only trained on the source domain data. Values reported in Table 2 are quoted from their original papers 2 . Our method, without extensive hyperparameter tuning, appears to be on par with the best state-of-the-art methods. DiCyR also outperforms others disentangle-ment and image-to-image methods. Specifically, DiCyR is only slightly outmatched by DWT and SEDA on the MNIST↔USPS and by SEDA and SHOT in the SVHN→MNIST benchmarks. The MNIST→SVHN case is a particularly challenging benchmark since MNIST images are greyscale and the adaptation to SVHN requires adapting to color images. SEDA makes extensive use of data augmentation to tackle this challenge and is thus the only method displaying convincing results. This hints to a possible enhancement of DiCyR in order to improve its performance. Finally, by introducing a new variation on batch normalization, DWT's contribution is orthogonal to ours and both could be combined. We emphasize that beyond these enhancements, a major advantage of DiCyR lies in the ability to disentangle the information in the target domain without direct supervision. DiCyR uses GRLs to ensure that no information is shared between T and S. Similarly to most methods in Table 2 , GRLs induce an adversarial optimization problem which is known to yield instability and variance in the resolution performance. In our case, this induces several distinct modes in the distribution of accuracies. It is interesting to note that the majority mode (that of the median) on SVHN→MNIST matches the best known performance. For this reason, we report both the mean and the median on this specific experiment. One might object that condition C3 was expressed in terms of mutual information. Thus, DiCyR only indirectly implements this condition using GRLs. An alternative could be to use an estimator of the mutual information, such as proposed by Belghazi et al. (2018) , to directly minimize it (and thus avoid the adversarial setting altogether). Such an approach was explored in the work of Sanchez et al. (2019) (Ganin et al., 2016) 85.1 73.0 73.9 35.7 88.6 ADDA (Tzeng et al., 2017) 89.4 90.1 76.0 --DSN (Bousmalis et al., 2016) 91.3 -82.7 -93.1 DRCN (Ghifary et al., 2016) 91.8 73.7 82.0 40.1 -DiDA (Cao et al., 2018) 92.5 -83.6 --SBADA-GAN (Russo et al., 2018) 97.6 95.0 76.1 61.1 96.7 CyCADA (Hoffman et al., 2018) 95.6 96.5 90.4 --DWT (Roy et al., 2019) 99.1 98.8 97.7 28.9 -SEDA (French et al., 2018) 98.2 99.5 99.3 97.0 99.3 ACAL (Hosseini-Asl et al., 2019) 98.3 97.2 96.5 60.8 -SHOT (Liang et al., 2020) 98.4 98.0 98.9 --DiCyR (ours) 98.4 98.3 98.5 1 23.8 97.4 1 median accuracy reported (average accuracy: 95.7 full results distribution are reported in Appendix F). Table 2 : Target domain accuracy, reported as percentages As in Section 5.1, we evaluate qualitatively the effectiveness of disentanglement, especially in the target domain, and produce visualizations of cross-domain style and task swapping. Here, we combine one domain's task information with the other domain's styles to reconstruct the images of Figures 4a, 4b , 4d, and 4e. The most important finding is that the style information was correctly disentangled from the task-related information in the target domain without the use of any label. Specifically, the rows in these figures show that the class information is preserved when a new style is applied, while the columns illustrate the efficient style transfer allowed by disentanglement. A desirable property of the task-related encoding is its domain invariance. To evaluate this aspect, we built a t-SNE representation (Hinton & Roweis, 2003) of the task-related features, in order to verify their alignment between domains. Figures 4c and 4f demonstrate this property. The previous experiments illustrated the use of DiCyR in the context of image classification. The method is, however, quite generic and can be applied in many more contexts. Figure 5 reports the improvement due to applying DiCyR for domain adaptation between the GTA5 (Richter et al., 2016) and the Cityscapes (Cordts et al., 2016) segmentation problems (detailed results in Appendix G). Finally, directly computing the distances on the task-related features in L domain cyclic often leads to unstable results. As hinted in Section 4, using instead a task oriented loss L domain cyclic = c(τ s ) -y 2 + c(τ t ) -c(τ st ) 2 stabilizes training and improves the target domain accuracy. Training c with cross-domain projections from the source domain and the corresponding labels improves its generalization to the target domain and forces the encoder to produce task-related features common to both domains. To illustrate this property, consider the following example. In one domain, the digit "7" is written with a middle bar, while in the other it has none. This domain-specific middle bar feature should not be expressed in T ; it should be considered as a task-orthogonal style feature. Thus using c's predictions within the domain cyclic loss, instead of distances in T , prevents the encoder from representing the domain-specific features in T and encourages their embedding in S.

A CROSS-DOMAIN DISENTANGLEMENT VISUALIZATIONS

We report extra cross-domain visualizations similar to those of Section 5.2 in Figures 6 and 7 . 

B NETWORK ARCHITECTURE AND TRAINING HYPER-PARAMETERS

The paragraphs below detail the network architctures used in the experiments of Section 5. It should be noted that neither these architectures, nor the associated hyper-parameters have been extensively and finely tuned to their respective tasks, as the goal of this contribution was to provide a generic, robust method. Thus it is likely that performance gains can still be obtained on this front.

B.1 SINGLE DOMAIN SUPERVISED DISENTANGLEMENT EXPERIMENTS

This section describes the network architecture and the hyper-parameters used in the experiments of Section 5.1. The encoder Π is composed of shared layers (layers 1 to 6 in the table below), followed by the specific task-related and style encodings. Those final layers are denoted Π τ and Π σ in the tables below. For the sake of implementation simplicity, we chose to project samples from the source domain and samples from the target domain into two separate style embeddings (one for each domain). Thus Π σ is actually duplicated in two heads Π σ,s and Π σ,t with the same structure and output space. We used the exact same network architectures for both the 3D shapes and SVHN datasets, the only difference being the dimension of the embeddings T and S. In all experiments, we applied a coefficient β reco = 5 to L reco and β cyclic = 0.1 to L cyclic in the global loss. We also use a β inf o on L inf o ; this coefficient increases linearly from 10 -2 to 10 over the first 10 epochs and remains at 10 afterwards (see Section C for a discussion on this coefficient). Convergence was reached within 50 epochs. We used Adam (Kingma & Ba, 2015) as an optimizer with a learning rate lr = 5 • 10 -4 for the first 30 epochs and lr = 5 • 10 -5 for the last 20 epochs. The following tables summarize the architectures of all sub-networks. For the GTA5 → Cityscapes experiment, we based our network's architecture on the one proposed by (Romera et al., 2017) is not absolutely necessary, we found out it helped the overall convergence. These coefficients gradually increase the weight of the information disentanglement objective and the cross-domain reconstruction objective. This assigns more importance to learning a good predictor c • Π during early stages. From this perspective, gradually increasing β inf o can be seen as gradually removing task-useless information from T and transferring it to S. Similarly, increasing β domain cyclic corresponds to letting the network discover disentangled representations before aligning them across domains. As previously mentioned, our goal in this study was to provide a robust disentanglement method that permits domain adaptation. Therefore, no complete hyper-parameter study and tuning was performed and these findings are thus reported as such and might be incomplete. Refining the understanding of the influence of the different β coefficients is closer to the problem of meta-learning and is beyond the scope of this paper.

D INFLUENCE OF BATCH NORMALIZATION AND DROPOUT ON DICYR

Batch normalization (Ioffe & Szegedy, 2015) is an efficient way to reduce the discrepancy between the source and target distributions statistics. We noticed that, for the specific SVHN → MNIST setting, using instance normalization (Ulyanov et al., 2016) slightly improves the target domain accuracy. Normalizing across chanels, the instance normalization layers helps the networks to be agnostic to the image contrast which is particularly strong in MNIST. We also noticed that using a large dropout in the sub-network c, and small embedding dimensions for Π's ouputs improves G DOMAIN ADAPTATION FOR THE GTA5→CITYSCAPES SEGMENTATION TASK Here we report on the application of DiCyR for the segmentation task in GTA5 (Richter et al., 2016) and Cityscapes (Cordts et al., 2016) . GTA5 is the source domain, where the ground truth of image segments is provided, and the goal is to reach efficient segmentation on the Cityscapes dataset. 4 reports the Intersection over Union criterion (the Jaccard index) for each object class in the Cityscapes images. We compare the seminal approach of Hoffman et al. (2016) coined "FCNs in the wild", which served as a baseline for CyCADA (Hoffman et al., 2018) , and the application of DiCyR. These results are preliminaryfoot_2 and did not benefit from any hyperparameter or architecture tuning. The purpose of this table is to report the out-of-the-box performance of DiCyR and validate the rationale behind performing disentanglement, on a challenging, large scale problem. 



Code and pre-trained networks available at https://github.com/AnonymousDiCyR/DiCyR. Comparisons might be inexact due to reproducibility concerns(Pineau et al., 2020) and these figures mostly indicate which are the top competing methods. They were obtained following a very valuable suggestion by a reviewer.



Figure 1: Network architectures

Figure 2: Swapping styles on SVHN and 3D Shapes

Figure 3: Nearest neighbors according to each representation

Figure 4: Cross-domain swapping and feature alignment visualization

Figure 5: GTA5 to Cityscapes segmentation

Figure 6: Cross-domain swapping, USPS→MNIST

Figure 7: Cross-domain swapping, MNIST→USPS

Figure 5 provided a first visual illustration of the benefits provided by DiCyR. Figure 9 provides yet other such examples. Column 9a presents the testing image from the Cityscapes dataset, column 9b shows the application on the testing image of a classifier trained only on the GTA5 images, column 9c shows the segmentation obtained by DiCyR, which can be compared to the ground truth (column 9d).

Figure 9: GTA5 to Cityscapes segmentation

Accuracies of a classifier trained to predict factors of style variation on 3D shapes

to disentangle representations between pairs of images, and would constitute a promising perspective of research for DA.

.

Source only 68.6 11.8 57.4 5.4 2.5 11.3 6.6 0.9 65.5 12.9 42.1 13.1 1.7 41.9 4.7 3.8 2.8 1.8 0.0 18.7 FCNs in the wild 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1 CyCADA 85.2 37.2 76.5 21.8 15.0 23.8 22.9 21.5 80.5 31.3 60.7 50.5 9.0 76.9 17.1 28.2 4.5 9.8 0.0 35.4 DiCyR 69.2 15.5 68.2 15.4 9.2 12 9.6 1.2 70.1 32.1 60.9 28.6 0.3 59.6 8.9 8.7 2.0 4.4 0.0 25.0 Intersection over Union (IoU) criterion for each object class

6. CONCLUSION

In this work, we introduced a new disentanglement method, called DiCyR, to separate task-related and task-orthogonal style information into different representations in the context of unsupervised domain adaptation. This method also provides a simple and efficient way to obtain disentangled representations for supervised learning problems. Its main features are its overall simplicity, the use of intra-domain and cross-domain cyclic reconstruction, and information separation through Gradient Reversal Layers. The design of this method stems from a formal definition of disentanglement for domain adaptation which, to the best of our knowledge, is new. The empirical evaluation shows that DiCyR performs as well as state-of-the-art methods, while offering the additional benefit of disentanglement, including in the target domains where no label information is available.

annex

This section describes the network architecture and the hyper-parameters used in the experiments of Section 5.2. The encoder Π is composed of shared layers (layers 1 to 9 in the tables below), followed by the specific task-related and style encodings. Those final layers are denoted Π τ and Π σ in the tables below. For the sake of implementation simplicity, we chose to project samples from the source domain and samples from the target domain into two separate style embeddings (one for each domain). Thus Π σ is actually duplicated in two heads Π σ,s and Π σ,t with the same structure and output space. In all experiments, in the global loss, we applied a coefficient β cyclic = 0.1 to L cyclic and β domain cyclic to L domain cyclic , with β domain cyclic increasing linearly from 0 to 10 during the 10 first epochs and remaining at 10 afterwards (see Section C for a discussion on this coefficient). Convergence was reached within 50 epochs (generally within 30 epochs). We used Adam (Kingma & Ba, 2015) as an optimizer with a learning rate lr = 5 • 10 -4 for the first 30 epochs and lr = 5 • 10 -5 for the last 20 epochs. The following tables summarize the architectures of all sub-networks. 

F DISTRIBUTION OF RESULTS

In Figure 8 we report the distribution of testing accuracies for the SVHN→MNIST benchmark of Section 5.2. Most of the accuracies are distributed within three modes with low variance, centered respectively around 89%, 94% and 98%, the latter being also the median and the majority mode. This sheds a more detailed light on the results reported in Section 5.2. As discussed therein, we conjecture these modes stem from the local equilibria of the adversarial optimization problem induced by the GRLs in DiCyR. Thus we expect that avoiding this problem altogether in the formulation of DiCyR could improve these resulting distributions. 

