INTEGRATING CATEGORICAL SEMANTICS INTO UNSU-PERVISED DOMAIN TRANSLATION

Abstract

While unsupervised domain translation (UDT) has seen a lot of success recently, we argue that mediating its translation via categorical semantic features could broaden its applicability. In particular, we demonstrate that categorical semantics improves the translation between perceptually different domains sharing multiple object categories. We propose a method to learn, in an unsupervised manner, categorical semantic features (such as object labels) that are invariant of the source and target domains. We show that conditioning the style encoder of unsupervised domain translation methods on the learned categorical semantics leads to a translation preserving the digits on MNIST$SVHN and to a more realistic stylization on Sketches!Reals. 1

1. INTRODUCTION

Domain translation has sparked a lot of interest in the computer vision community following the work of Isola et al. (2016) on image-to-image translation. This was done by learning a conditional GAN (Mirza & Osindero, 2014) , in a supervised manner, using paired samples from the source and target domains. CycleGAN (Zhu et al., 2017a) considered the task of unpaired and unsupervised image-to-image translation, showing that such a translation was possible by simply learning a mapping and its inverse under a cycle-consistency constraint, with GAN losses for each domain. But, as has been noted, despite the cycle-consistency constraint, the proposed translation problem is fundamentally ill-posed and can consequently result in arbitrary mappings (Benaim et al., 2018; Galanti et al., 2018; de Bézenac et al., 2019) . Nevertheless, CycleGAN and its derivatives have shown impressive empirical results on a variety of image translation tasks. Galanti et al. (2018) and de Bézenac et al. (2019) argue that CycleGAN's success is owed, for the most part, to architectural choices that induce implicit biases toward minimal complexity mappings. That being said, CycleGAN, and follow-up works on unsupervised domain translation, have commonly been applied on domains in which a translation entails little geometric changes and the style of the generated sample is independent of the semantic content in the source sample. Commonly showcased examples include translating edges$shoes and horses$zebras. While these approaches are not without applications, we demonstrate two situations where unsupervised domain translation methods are currently lacking. The first one, which we call Semantic-Preserving Unsupervised Domain Translation (SPUDT), is defined as translating, without supervision, between domains that share common semantic attributes. Such attributes may be a non-trivial composition of features obfuscated by domain-dependent spurious features, making it hard for the current methods to translate the samples while preserving the shared semantics despite the implicit bias. Translating between MNIST$SVHN is an example of translation where the shared semantics, the digit identity, is obfuscated by many spurious features, such as colours and background distractors, in the SVHN domains. In section 4.1, we take this specific example and demonstrate that using domain invariant categorical semantics improves the digit preservation in UDT. The second situation that we consider is Style-Heterogeneous Domain Translation (SHDT). SHDT refers to a translation in which the target domain includes many semantic categories, with a distinct ⇤ Correspondence to: samuel.lavoie.m@gmail.com. † CIFAR fellow 1 The public code can be found: https://github.com/lavoiems/Cats-UDT. style per semantic category. We demonstrate that, in this situation, the style encoder must be conditioned on the shared semantics to generate a style consistent with the semantics of the given source image. In Section 4.2, we consider an example of this problem where we translate an ensemble of sketches, with different objects among them, to real images. In this paper, we explore both the SPUDT and SHDT settings. In particular, we demonstrate how domain invariant categorical semantics can improve translation in these settings. Existing works (Hoffman et al., 2018; Bousmalis et al., 2017) have considered semi-supervised variants by training a classifier with labels on the source domain. But, differently from them, we show that it is possible to perform well at both kinds of tasks without any supervision, simply with access to unlabelled samples from the two domains. This additional constraint may further enable applications of domain translation in situations where labelled data is absent or scarce. To tackle these problems, we propose a method which we refer to as Categorical Semantics Unsupervised Domain Translation (CatS-UDT). CatS-UDT consists of two steps: (1) learning an inference model of the shared categorical semantics across the domains of interest without supervision and (2) using a domain translation model in which we condition the style generation by inferring the learned semantics of the source sample using the model learned at the previous step. We depict the first step in Figure 1b and the second in Figure 2 . More specifically, the contributions of this work are the following: • Novel framework for learning invariant categorical semantics across domains (Section 3.1). • Introduction of a method of semantic style modulation to make SHDT generations more consistent (Section 3.2). • Comparison with UDT baselines on SPUDT and SHDT highlighting their existing challenges and demonstrating the relevance of our incorporating semantics into UDT (Section 4).

2. RELATED WORKS

Domain translation is concerned with translating samples from a source domain to a target domain. In general, we categorize a translation that uses pairing or supervision through labels as supervised domain translation and a translation that does not use pairing or labels as unsupervised domain translation. Supervised domain translation methods have generally achieved success through either the use of pairing or the use of supervised labels. Methods that leverage the use of category labels include Taigman et al. ( 2017 2020) leverage paired samples as a signal to guide the translation. Also, some works propose to leverage a segmentation mask (Tomei et al., 2019; Roy et al., 2019; Mo et al., 2019) . Another strategy is to use the representation of a pre-trained network as semantic information (Ma et al., 2019; Wang et al., 2019; Wu et al., 2019; Zhang et al., 2020) . Such a representation typically comes from the intermediate layer of a VGG (Liu & Deng, 2015) network pre-trained with labelled ImageNET (Deng et al., 2009) . Conversely to our work, (Murez et al., 2018) propose to use image-to-image translation to regularize domain adaptation. Unsupervised domain translation considers the task of domain translation without any supervision, whether through labels or pairing of images across domains. CycleGAN (Zhu et al., 2017a) proposed to learn a mapping and its inverse constrained with a cycle-consistency loss. The authors demonstrated that CycleGAN works surprisingly well for some translation problems. Later works have improved this class of models (Liu et al., 2017; Kim et al., 2017; Almahairi et al., 2018; Huang et al., 2018; Choi et al., 2017; 2019; Press et al., 2019) , enabling multi-modal and more diverse generations. But, as shown in Galanti et al. (2018) , the success of these methods is mostly due to architectural constraints and regularizers that implicitly bias the translation toward mappings with minimum complexity. We recognize the usefulness of this inductive bias for preserving low-level features like the pose of the source image. This observation motivates the method proposed in Section 3.2 for conditioning the style using the semantics.



); Hoffman et al. (2018); Bousmalis et al. (2017). The differences between these approaches lie in particular architectural choices and auxiliary objectives for training the translation network. Alternatively, Isola et al. (2016); Gonzalez-Garcia et al. (2018); Wang et al. (2018; 2019); Zhang et al. (

