LEARNING DISENTANGLED REPRESENTATIONS FOR IMAGE TRANSLATION

Abstract

Recent approaches for unsupervised image translation are strongly reliant on generative adversarial training and architectural locality constraints. Despite their appealing results, it can be easily observed that the learned class and content representations are entangled which often hurts the translation performance. To this end, we propose OverLORD, for learning disentangled representations for the image class and attributes, utilizing latent optimization and carefully designed content and style bottlenecks. We further argue that the commonly used adversarial optimization can be decoupled from representation disentanglement and be applied at a later stage of the training to increase the perceptual quality of the generated images. Based on these principles, our model learns significantly more disentangled representations and achieves higher translation quality and greater output diversity than state-of-the-art methods.

1. INTRODUCTION

Learning disentangled representations from a set of observations is a fundamental problem in machine learning. Such representations can facilitate generalization to downstream discriminative and generative tasks as well as improving interpretability (Hsu et al., 2017 ), reasoning (van Steenkiste et al., 2019) and fairness (Creager et al., 2019) . Recent advances have contributed to various tasks such as novel image synthesis (Zhu et al., 2018) and person re-identification (Eom & Ham, 2019) . Image translation is an extensively researched task that benefits from disentanglement. Its goal is to generate an analogous image in a target domain (e.g. cats) given an input image in a source domain (e.g. dogs). Although this task is generally poorly specified, it is often satisfied under the assumption that images in different domains share common attributes (e.g. head pose) which can be transferred during translation -we name those content. In many cases, the class (domain) and common attributes do not uniquely specify the target image e.g. there are many dog breeds with the same head pose. This multi-modal translation motivates the specification of the particular classspecific attributes that we wish the target image to have -we name those style. The ability to transfer the content of a source image to a target class and style has been the objective of several methods e.g. MUNIT (Huang et al., 2018) , FUNIT (Liu et al., 2019) and StartGAN-v2 (Choi et al., 2020) . Unfortunately, we show that despite their visually pleasing results, the translated images still retain many class-specific attributes of the original image resulting in limited translation quality. For example, when translating dogs to wild animals, current methods are prone to transfer facial shapes which are unique to dogs and should not be transferred precisely to wild animals. As demonstrated in Fig. 1 , our model transfers the semantic head pose more reliably. In this work, we analyze the class-supervised setting and present a principled objective for disentangling image class and attributes. We explain why LORD, introduced by (Gabbay & Hoshen, 2020), cannot be applied for multi-modal translation. We then show that introducing an additional style representation overcomes this issue and propose a practical method for high-fidelity image translation by learning disentangled representations. Our method achieves this in two stages; i) Disentanglement: disentangled representation learning in a non-adversarial framework, leveraging latent optimization and well-motivated content and style bottlenecks. ii) Synthesis: the disentangled representations learned in the previous stage are used to "supervise" a synthesis network that generalizes to unseen images and classes. As synthesis network training is well-conditioned, we can effectively incorporate a GAN loss resulting in a high-fidelity image translation model. Our approach illustrates that adversarial optimization, which is typically used for domain translation, 1

