INTEGRATING CATEGORICAL SEMANTICS INTO UNSU-PERVISED DOMAIN TRANSLATION

Abstract

While unsupervised domain translation (UDT) has seen a lot of success recently, we argue that mediating its translation via categorical semantic features could broaden its applicability. In particular, we demonstrate that categorical semantics improves the translation between perceptually different domains sharing multiple object categories. We propose a method to learn, in an unsupervised manner, categorical semantic features (such as object labels) that are invariant of the source and target domains. We show that conditioning the style encoder of unsupervised domain translation methods on the learned categorical semantics leads to a translation preserving the digits on MNIST$SVHN and to a more realistic stylization on Sketches!Reals.

1. INTRODUCTION

Domain translation has sparked a lot of interest in the computer vision community following the work of Isola et al. (2016) on image-to-image translation. This was done by learning a conditional GAN (Mirza & Osindero, 2014) , in a supervised manner, using paired samples from the source and target domains. CycleGAN (Zhu et al., 2017a) considered the task of unpaired and unsupervised image-to-image translation, showing that such a translation was possible by simply learning a mapping and its inverse under a cycle-consistency constraint, with GAN losses for each domain. But, as has been noted, despite the cycle-consistency constraint, the proposed translation problem is fundamentally ill-posed and can consequently result in arbitrary mappings (Benaim et al., 2018; Galanti et al., 2018; de Bézenac et al., 2019) . Nevertheless, CycleGAN and its derivatives have shown impressive empirical results on a variety of image translation tasks. Galanti et al. ( 2018) and de Bézenac et al. (2019) argue that CycleGAN's success is owed, for the most part, to architectural choices that induce implicit biases toward minimal complexity mappings. That being said, CycleGAN, and follow-up works on unsupervised domain translation, have commonly been applied on domains in which a translation entails little geometric changes and the style of the generated sample is independent of the semantic content in the source sample. Commonly showcased examples include translating edges$shoes and horses$zebras. While these approaches are not without applications, we demonstrate two situations where unsupervised domain translation methods are currently lacking. The first one, which we call Semantic-Preserving Unsupervised Domain Translation (SPUDT), is defined as translating, without supervision, between domains that share common semantic attributes. Such attributes may be a non-trivial composition of features obfuscated by domain-dependent spurious features, making it hard for the current methods to translate the samples while preserving the shared semantics despite the implicit bias. Translating between MNIST$SVHN is an example of translation where the shared semantics, the digit identity, is obfuscated by many spurious features, such as colours and background distractors, in the SVHN domains. In section 4.1, we take this specific example and demonstrate that using domain invariant categorical semantics improves the digit preservation in UDT. The second situation that we consider is Style-Heterogeneous Domain Translation (SHDT). SHDT refers to a translation in which the target domain includes many semantic categories, with a distinct

