MINIMAL GEOMETRY-DISTORTION CONSTRAINT FOR UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION

Abstract

Unsupervised image-to-image (I2I) translation, which aims to learn a domain mapping function without paired data, is very challenging because the function is highly under-constrained. Despite the significant progress in constraining the mapping function, current methods suffer from the geometry distortion problem: the geometry structure of the translated image is inconsistent with the input source image, which may cause the undesired distortions in the translated images. To remedy this issue, we propose a novel I2I translation constraint, called Minimal Geometry-Distortion Constraint (MGC), which promotes the consistency of geometry structures and reduce the unwanted distortions in translation by reducing the randomness of color transformation in the translation process. To facilitate estimation and maximization of MGC, we propose an approximate representation of mutual information called relative Squared-loss Mutual Information (rSMI) that can be efficiently estimated analytically. We demonstrate the effectiveness of our MGC by providing quantitative and qualitative comparisons with the state-of-the-art methods on several benchmark datasets.

1. INTRODUCTION

Image-to-image translation, or domain mapping, aims to translate an image in the source domain X to the target domain Y. It has been extensively studied (Pathak et al., 2016; Isola et al., 2017; Liu et al., 2019) and has been applied to various vision tasks (Sela et al., 2017; Siddiquee et al., 2019; Ghosh et al., 2019; Tomei et al., 2019; Wu et al., 2019) . Early works considered supervised image-toimage (I2I) translation, where paired samples {(x i , y i )} N i=1 drawn from the joint distribution P XY are available. In the presence of paired data, methods based on conditional generative adversarial networks can generate high-quality translations (Isola et al., 2017; Wang et al., 2018; Pathak et al., 2016) . However, since paired data are often unavailable or expensive to obtain, unsupervised I2I translation has attracted intense attention in recent years (Zhu et al., 2017; Yi et al., 2017; Kim et al., 2017; Benaim & Wolf, 2017; Huang et al., 2018; Lee et al., 2019; Kim et al., 2019; Park et al., 2020) . Benefiting from generative adversarial networks (GANs) (Goodfellow et al., 2014) , one can perform unsupervised I2I translation by finding G XY such that the translated images and target domain images have similar distributions, i.e., P G XY (X) ≈ P Y . Due to an infinite number of functions that can satisfy the adversarial loss, GAN alone cannot guarantee the learning of the true mapping function, resulting in sub-optimal translation performance. To remedy this issue, various kinds of constraints have been placed on the learned mapping function. For instance, the well-known cycle-consistency (Zhu et al., 2017; Kim et al., 2017; Yi et al., 2017) enforces the translation function G XY to be bijective. DistanceGAN (Benaim & Wolf, 2017) preserves the pairwise distances in the source images. GcGAN (Fu et al., 2019) forces the function to be smooth w.r.t. certain geometric transformations of input images. DRIT++ (Lee et al., 2019) and MUNIT (Huang et al., 2018) learn disentangled representations by embedding images onto a domain-invariant content space and a domain-specific attribute space and the mapping function can be derived from representation learning components. However, the mapping functions learned by current methods are still far from satisfactory in real applications. Here we consider a simple but widely applicable image translation task, i.e., geometryinvariant translation task. In this task, the geometric structure (e.g. the shapes of objects) in images in the source and target domain is invariant and the variation of photometric information of a certain geometric area is expected to conform with the change of style information, such as the colour of a leaf is green in summer and white in winter. Existing methods enforced in geometry-invariant translation

