MINIMAL GEOMETRY-DISTORTION CONSTRAINT FOR UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION

Abstract

Unsupervised image-to-image (I2I) translation, which aims to learn a domain mapping function without paired data, is very challenging because the function is highly under-constrained. Despite the significant progress in constraining the mapping function, current methods suffer from the geometry distortion problem: the geometry structure of the translated image is inconsistent with the input source image, which may cause the undesired distortions in the translated images. To remedy this issue, we propose a novel I2I translation constraint, called Minimal Geometry-Distortion Constraint (MGC), which promotes the consistency of geometry structures and reduce the unwanted distortions in translation by reducing the randomness of color transformation in the translation process. To facilitate estimation and maximization of MGC, we propose an approximate representation of mutual information called relative Squared-loss Mutual Information (rSMI) that can be efficiently estimated analytically. We demonstrate the effectiveness of our MGC by providing quantitative and qualitative comparisons with the state-of-the-art methods on several benchmark datasets.

1. INTRODUCTION

Image-to-image translation, or domain mapping, aims to translate an image in the source domain X to the target domain Y. It has been extensively studied (Pathak et al., 2016; Isola et al., 2017; Liu et al., 2019) and has been applied to various vision tasks (Sela et al., 2017; Siddiquee et al., 2019; Ghosh et al., 2019; Tomei et al., 2019; Wu et al., 2019) . Early works considered supervised image-toimage (I2I) translation, where paired samples {(x i , y i )} N i=1 drawn from the joint distribution P XY are available. In the presence of paired data, methods based on conditional generative adversarial networks can generate high-quality translations (Isola et al., 2017; Wang et al., 2018; Pathak et al., 2016) . However, since paired data are often unavailable or expensive to obtain, unsupervised I2I translation has attracted intense attention in recent years (Zhu et al., 2017; Yi et al., 2017; Kim et al., 2017; Benaim & Wolf, 2017; Huang et al., 2018; Lee et al., 2019; Kim et al., 2019; Park et al., 2020) . Benefiting from generative adversarial networks (GANs) (Goodfellow et al., 2014) , one can perform unsupervised I2I translation by finding G XY such that the translated images and target domain images have similar distributions, i.e., P G XY (X) ≈ P Y . Due to an infinite number of functions that can satisfy the adversarial loss, GAN alone cannot guarantee the learning of the true mapping function, resulting in sub-optimal translation performance. To remedy this issue, various kinds of constraints have been placed on the learned mapping function. For instance, the well-known cycle-consistency (Zhu et al., 2017; Kim et al., 2017; Yi et al., 2017) enforces the translation function G XY to be bijective. DistanceGAN (Benaim & Wolf, 2017) preserves the pairwise distances in the source images. GcGAN (Fu et al., 2019) forces the function to be smooth w.r.t. certain geometric transformations of input images. DRIT++ (Lee et al., 2019) and MUNIT (Huang et al., 2018) learn disentangled representations by embedding images onto a domain-invariant content space and a domain-specific attribute space and the mapping function can be derived from representation learning components. However, the mapping functions learned by current methods are still far from satisfactory in real applications. Here we consider a simple but widely applicable image translation task, i.e., geometryinvariant translation task. In this task, the geometric structure (e.g. the shapes of objects) in images in the source and target domain is invariant and the variation of photometric information of a certain geometric area is expected to conform with the change of style information, such as the colour of a leaf is green in summer and white in winter. Existing methods enforced in geometry-invariant translation In this paper, we propose a new constraint for unsupervised geometry-invariant image translation, called minimal geometry-distortion constraint (MGC), as a general I2I translation constraint to guarantee the consistency of geometry structure of source and translated images, and thus reduce translation mismatch in the translation process. We observe that the pixel values before and after translation are usually highly correlated if the geometric structure is preserved because the color transformation is more regular within specific object regions. Taking the color transformation of a leaf as an example, the transformation of a green leaf into a red leaf contain less randomness than into a colorful one. Based on this observation, we propose a mutual information (MI)-based dependency measure that models the nonlinear relationships of pixel values in the source and translated images. To estimate MI from data, we propose the relative Squared-Loss Mutual Information (rSMI) which can be efficiently estimated in an analytic form. By maximizing rSMI together with the GAN loss, our approach can significantly reduce the geometry distortion by better preserving geometric structures. In the experiments, we incorporate our minimal geometry-distortion constraint into the GAN framework and show its effectiveness of preserving geometric structures when used both independently and combined with existing constraints (e.g. cycle consistency) to show its compatibility. The quantitative and qualitative comparisons with baselines (models without MGC) and state-of-the-art methods on several datasets demonstrate the superiority of the proposed MGC constraint.

2. RELATED WORK

Unsupervised Image-to-Image Translation. In unsupervised image-to-image (I2I) translation, unaligned examples drawn individually from the marginal distribution of the source domain and target domain are available. Although the subject has obtained some promising progress in recent years, only several works study it from an optimization perspective. Specifically, Cyclic consistency based GAN, e.g., CycleGAN (Zhu et al., 2017) , DualGAN (Yi et al., 2017) and DiscoGAN (Kim et al., 2017) , is a general approach for this problem. DistanceGAN (Benaim & Wolf, 2017) and GcGAN (Fu et al., 2019) further introduced distance and geometry transformation consistency to



Figure 1: The illustration of how random color transformation causes the geometry-distortion problem in unsupervised image translation. Images in the first column are input images, and images in the second column are the translated images by CycleGAN and GAN+MGC.Visually, as the color transformation of the corresponding region between the input and translated image shows, the color of the human face is translated to several colors randomly by CycleGAN, leading to the distortion of face shape. In contrast, the color transformation in the GAN+MGC is consistent, and thus preserve the shape of the human face. To reveal the randomness of color transformation quantitatively, the third column images show that non-linear dependencies between pixel values in the input image and its corresponding pixel value in the translated image. Obviously, the geometry-preserved translated image (by GAN+MGC) has stronger color dependency than geometry-distorted one (by CycleGAN). tasks still suffer from geometry distortion problem, where the geometry structures in the source and translated image are not consistent, resulting in the mismatch of input and translated images. A representative example is that the mapping function G XY learned with the cycle-consistency(Zhu  et al., 2017)  often changes the geometry structures of digits in the SVHN → MNIST translation task, so some digits in source domain images are translated accidentally into other digits. The geometry distortion problem hinders the application of unsupervised geometry-invariant translation methods into a wide range of computer vision applications, such as domain adaptation(Hoffman et al., 2017), segmentation(Zhu et al., 2017)  and style transfer(Huang et al., 2018).

