LEARNING DISENTANGLED REPRESENTATIONS FOR IMAGE TRANSLATION

Abstract

Recent approaches for unsupervised image translation are strongly reliant on generative adversarial training and architectural locality constraints. Despite their appealing results, it can be easily observed that the learned class and content representations are entangled which often hurts the translation performance. To this end, we propose OverLORD, for learning disentangled representations for the image class and attributes, utilizing latent optimization and carefully designed content and style bottlenecks. We further argue that the commonly used adversarial optimization can be decoupled from representation disentanglement and be applied at a later stage of the training to increase the perceptual quality of the generated images. Based on these principles, our model learns significantly more disentangled representations and achieves higher translation quality and greater output diversity than state-of-the-art methods.

1. INTRODUCTION

Learning disentangled representations from a set of observations is a fundamental problem in machine learning. Such representations can facilitate generalization to downstream discriminative and generative tasks as well as improving interpretability (Hsu et al., 2017) , reasoning (van Steenkiste et al., 2019) and fairness (Creager et al., 2019) . Recent advances have contributed to various tasks such as novel image synthesis (Zhu et al., 2018) and person re-identification (Eom & Ham, 2019) . Image translation is an extensively researched task that benefits from disentanglement. Its goal is to generate an analogous image in a target domain (e.g. cats) given an input image in a source domain (e.g. dogs). Although this task is generally poorly specified, it is often satisfied under the assumption that images in different domains share common attributes (e.g. head pose) which can be transferred during translation -we name those content. In many cases, the class (domain) and common attributes do not uniquely specify the target image e.g. there are many dog breeds with the same head pose. This multi-modal translation motivates the specification of the particular classspecific attributes that we wish the target image to have -we name those style. The ability to transfer the content of a source image to a target class and style has been the objective of several methods e.g. MUNIT (Huang et al., 2018) , FUNIT (Liu et al., 2019) and StartGAN-v2 (Choi et al., 2020) . Unfortunately, we show that despite their visually pleasing results, the translated images still retain many class-specific attributes of the original image resulting in limited translation quality. For example, when translating dogs to wild animals, current methods are prone to transfer facial shapes which are unique to dogs and should not be transferred precisely to wild animals. As demonstrated in Fig. 1 , our model transfers the semantic head pose more reliably. In this work, we analyze the class-supervised setting and present a principled objective for disentangling image class and attributes. We explain why LORD, introduced by (Gabbay & Hoshen, 2020), cannot be applied for multi-modal translation. We then show that introducing an additional style representation overcomes this issue and propose a practical method for high-fidelity image translation by learning disentangled representations. Our method achieves this in two stages; i) Disentanglement: disentangled representation learning in a non-adversarial framework, leveraging latent optimization and well-motivated content and style bottlenecks. ii) Synthesis: the disentangled representations learned in the previous stage are used to "supervise" a synthesis network that generalizes to unseen images and classes. As synthesis network training is well-conditioned, we can effectively incorporate a GAN loss resulting in a high-fidelity image translation model. Our approach illustrates that adversarial optimization, which is typically used for domain translation, is not necessary for disentanglement, and its main utility lies in generating perceptually pleasing images. Our model learns disentangled representations and achieves better translation quality and output diversity than current methods. Our contributions are: i) Introducing a non-adversarial disentanglement method that enables multimodal solutions. ii) Learning statistically disentangled representations. iii) Extending domain translation methods to cases with many (10k) domains. iv) State-of-the-art results in image-translation. Class-Supervised Disentanglement In this parallel line of work, the goal is to anchor the semantics of all the images within each class into a separate class representation while modeling all the remaining class-independent attributes by a content representation. Several methods encourage disentanglement by adversarial constraints (Denton et al., 2017; Szabó et al., 2018; Mathieu et al., 2016) while other rely on cycle consistency (Harsh Jha et al., 2018) or group accumulation (Bouchacourt et al., 2018) . LORD (Gabbay & Hoshen, 2020) takes a non-adversarial approach and trains a generative model while directly optimizing over class and content codes. Most works in this area demonstrate domain translation results on simple datasets but not in the multi-modal (many-tomany) settings. Moreover, their focus is to achieve disentanglement at the representation level rather than designing architectures for high-resolution image translation resulting in weak performance on competitive benchmarks. In this work, we draw inspiration from LORD in relying on the inductive bias conferred by latent optimization to learn a disentangled content representation. In contrast to LORD, we tackle the multi-modal image translation setting by modeling style. Moreover, we add an adversarial term in the synthesis network that increases the image quality and resolution.

2. BACKGROUND: REPRESENTATION LEARNING IN IMAGE-TRANSLATION

Image translation takes as input a set of N images and corresponding class labels (x 1 , y 1 ), (x 2 , y 2 ), ..., (x N , y N ). Let us assume that an image x i is fully specified by its class y i and attributes a i . As a motivational example, let us consider the images x i to be of animals, and the class label y i denotes the species. Attributes a i , may include attributes a c i common to all classes



Figure 1: Examples of entanglement in state-of-the-art image translation models. Current approaches and their architectural biases tightly preserve the original structure and generate unreliable translations. Our model disentangles the high level content and captures the target style faithfully.

RELATED WORKImage Translation Translating the content of images across different domains has attracted much attention. In the unsupervised setting,CycleGAN (Zhu et al., 2017)  introduces a cycle consistency loss to encourage translated images preserves the domain-invariant attributes (e.g. pose) of the source image.MUNIT (Huang et al., 2018)  recognized that a given content image could be transferred to many different styles (e.g. colors and textures) in a target domain and extends UNIT (Huang & Belongie, 2017) to learn multi-modal mappings by learning style representations.DRIT  (Lee et al., 2018)  tackles the same setting using an adversarial constraint at the representation level.MSGAN (Mao et al., 2019)  added a regularization term to prevent mode collapse.StarGAN-v2  (Choi et al., 2020)  andDMIT (Yu et al., 2019)  extend previous frameworks to translation across more than two domains.FUNIT (Liu et al., 2019)  further allows translation to novel domains.

