TRIP: REFINING IMAGE-TO-IMAGE TRANSLATION VIA RIVAL PREFERENCES

Abstract

We propose a new model to refine image-to-image translation via an adversarial ranking process. In particular, we simultaneously train two modules: a generator that translates an input image to the desired image with smooth subtle changes with respect to some specific attributes; and a ranker that ranks rival preferences consisting of the input image and the desired image. Rival preferences refer to the adversarial ranking process: (1) the ranker thinks no difference between the desired image and the input image in terms of the desired attributes; (2) the generator fools the ranker to believe that the desired image changes the attributes over the input image as desired. Preferences over pairs of real images are introduced to guide the ranker to rank image pairs regarding the interested attributes only. With an effective ranker, the generator would "win" the adversarial game by producing high-quality images that present desired changes over the attributes compared to the input image. The experiments demonstrate that our TRIP can generate high-fidelity images which exhibit smooth changes with the strength of the attributes.

1. INTRODUCTION

Image-to-image (I2I) translation (Isola et al., 2017) aims to translate an input image into the desired ones with changes in some specific attributes. Current literature can be classified into two categories: binary translation (Zhu et al., 2017; Kim et al., 2017) , e.g., translating an image from "not smiling" to "smiling"; fine-grained translation (Lample et al., 2017; He et al., 2019; Liu et al., 2018; Saquil et al., 2018) , e.g., generating a series of images with smooth changes from "not smiling" to "smiling". In this work, we focus on the high-quality fine-grained I2I translation, namely, generate a series of realistic versions of the input image with smooth changes in the specific attributes (See Fig. 1 ). Note that the desired high-quality images in our context are two folds: first, the generated images look as realistic as training images; second, the generated images are only modified in terms of the specific attributes. Relative attribute (RA), referring to the preference of two images over the strength of the interested attribute, is widely used in the fine-grained I2I translation task due to their rich semantic information. Previous work Ranking Conditional Generative Adversarial Network (RCGAN) (Saquil et al., 2018) adopts two separate criteria for a high-quality fine-grained translation. Specifically, a ranker is adopted to distill the discrepancy from RAs regarding the targeted attribute, which then guides the generator to translate the input image into the desired one. Meanwhile, a discriminator ensures the generated images as realistic as the training images. However, the generated fine-grained images guided by the ranker are out of the real data distribution, which conflicts with the goal of the discriminator. Therefore, the generated images cannot maintain smooth changes and suffer from low-quality issues. RelGAN (Wu et al., 2019) applied a unified discriminator for the high-quality fine-grained translation. The discriminator guides the generator to learn the distribution of triplets, which consist of pairs of images and their corresponding numerical labels (i.e., relative attributes). Further, RelGAN adopted the fine-grained RAs within the same framework to enable a smooth interpolation. However, the joint data distribution matching does not explicitly model the discrepancy from the RAs and fails to capture sufficient semantic information. The generated images fail to change smoothly over the interested attribute. In this paper, we propose a new adversarial ranking framework consisting of a ranker and a generator for high-quality fine-grained translation. In particular, the ranker explicitly learns to model the discrepancy from the relative attributes, which can guide the generator to produce the desired image from the input image. Meanwhile, the rival preference consisting of the generated image and the input image is constructed to evoke the adversarial training between the ranker and the generator. Specifically, the ranker cannot differentiate the strength of the interested attribute between the generated image and the input image; while the generator aims to achieve the agreement from the ranker that the generated image holds the desired difference compared to the input. Competition between the ranker and the generator drives both two modules to improve themselves until the generations exhibit desired preferences while possessing high fidelity. We summarize our contributions as follows: • We propose Translation via RIval Preference (TRIP) consisting of a ranker and a generator for a high-quality fine-grained translation. The rival preference is constructed to evoke the adversarial training between the ranker and the generator, which enhances the ability of the ranker and encourages a better generator. • Our tailor-designed ranker enforces a continuous change between the generated image and the input image, which promotes a better fine-grained control over the interested attribute. • Empirical results show that our TRIP achieves the state-of-art results on the fine-grained imageto-image translation task. Meanwhile, the input image can be manipulated linearly along the strength of the attribute. • We further extend TRIP to the fine-grained I2I translation of multiple attributes. A case study demonstrates the efficacy of our TRIP in terms of disentangling multiple attributes and manipulating them simultaneously.

2. RELATED WORKS

We mainly review the literature related to fine-grained I2I translation, especially smooth facial attribute transfer. We summarized them based on the type of generative models used. AE/VAE-based methods can provide a good latent representation of the input image. Some works (Lample et al., 2017; Liu et al., 2018; Li et al., 2020; Ding et al., 2020) proposed to disentangle the attribute-dependent latent variable from the image representation but resorted to different disentanglement strategies. Then the fine-grained translation can be derived by smoothly manipulating the attribute variable of the input image. However, the reconstruction loss, which is used to ensure the image quality, cannot guarantee a high fidelity of the hallucinated images. Flow-based Some works (Kondo et al., 2019) incorporates feature disentanglement mechanism into flow-based generative models. However, the designed multi-scale disentanglement requires large computation. And the reported results did not show satisfactory performance on smooth control. GAN-based GAN is a widely adopted framework for a high-quality image generation. Various methods applied GAN as a base for fine-grained I2I translation through relative attributes. The main differences lie in the strategies of incorporating the preference over the attributes into the image generation process. Saquil et al. (2018) adopted two critics consisting of a ranker, learning from the relative attributes, and a discriminator, ensuring the image quality. Then the combination of two critics is supposed to guide the generator to produce high-quality fine-grained images. However, the ranker would induce the generator to generate out-of-data-distribution images, which is opposite to the target of the discriminator, thereby resulting in poor-quality images. Wu et al. ( 2019) applied a unified discriminator, which learns the joint data distribution of the triplet constructed with a pair of images and a discrete numerical label (i.e., relative attribute). However, such a joint distribution modeling approach only models the discrete discrepancy of the RAs, which fails to generalize to the continuous labels very well. Rather than using RAs, He et al. ( 2019) directly modeled the attribute



Figure 1: Fine-grained Image-to-image translation on the "smile" attribute (generated by our TRIP).

