ENJOY YOUR EDITING: CONTROLLABLE GANS FOR IMAGE EDITING VIA LATENT SPACE NAVIGATION

Abstract

Controllable semantic image editing enables a user to change entire image attributes with a few clicks, e.g., gradually making a summer scene look like it was taken in winter. Classic approaches for this task use a Generative Adversarial Net (GAN) to learn a latent space and suitable latent-space transformations. However, current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism. To address these concerns, we learn multiple attribute transformations simultaneously, integrate attribute regression into the training of transformation functions, and apply a content loss and an adversarial loss that encourages the maintenance of image identity and photo-realism. We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work, which primarily focuses on qualitative evaluation. Our model permits better control for both single-and multipleattribute editing while preserving image identity and realism during transformation. We provide empirical results for both natural and synthetic images, highlighting that our model achieves state-of-the-art performance for targeted image manipulation.

1. INTRODUCTION

Semantic image editing is the task of transforming a source image to a target image while modifying desired semantic attributes, e.g., to make an image taken during summer look like it was captured in winter. The ability to semantically edit images is useful for various real-world tasks, including artistic visualization, design, photo enhancement, and targeted data augmentation. To this end, semantic image editing has two primary goals: (i) providing continuous manipulation of multiple attributes simultaneously and (ii) maintaining the original image's identity as much as possible while ensuring photo-realism. Existing GAN-based approaches for semantic image editing can be categorized roughly into two groups: (i) image-space editing methods directly transform one image to another across domains (Choi et al., 2018; 2020; Isola et al., 2017; Lee et al., 2020; Wu et al., 2019; Zhu et al., 2017a; b) , usually using variants of generative adversarial nets (GANs) (Goodfellow et al., 2014) . These approaches often have high computational cost, and they primarily focus on binary attribute (on/off) changes, rather than providing continuous attribute editing abilities. (ii) latent-space editing methods focus on discovering latent variable manipulations that permit continuous semantic image edits. The chosen latent space is most often the latent space of GANs. Both unsupervised and (self-)supervised latent space editing methods have been proposed. Unsupervised latent-space editing methods (Härkönen et al., 2020; Voynov & Babenko, 2020) are often less effective at providing semantically meaningful directions and all too often change image identity during an edit. Current (self-)supervised methods (Jahanian et al., 2019; Plumerault et al., 2020) are limited to geometric edits such as rotation and scale. To our knowledge, only one supervised approach has been proposed (Shen et al., 2019) -developed to discover semantic latent-space directions for binary attributes. As we show, this method suffers from entangled attributes and often does not preserve image identity during manipulation. Contributions. We propose a latent-space editing framework for semantic image manipulation that fulfills the aforementioned primary goals. Specifically, we use a GAN and employ a joint sampling strategy trained to edit multiple attributes simultaneously. To disentangle attribute transformations in the latent space of GANs, we integrate a regressor to predict the attributes that an image exhibits. The regressor also permits precise control of the manipulation degree and is easily extended to multiple attributes simultaneously. In addition, we incorporate a perceptual loss (Li et al., 2019) and an adversarial loss that helps preserve image identity and photo-realism during manipulation. We compare our method to several popular frameworks, from existing image-to-image translation methods (Choi et al., 2020; Wu et al., 2019; Zhu et al., 2017a) to latent space transformation-based approaches (Shen et al., 2019; Voynov & Babenko, 2020) . We mention that prior work primarily uses qualitative evaluation like the one in Fig. 1 . In contrast, we propose a quantitative evaluation to measure controllability. Both qualitative and quantitative results provide evidence that our approach outperforms prior work in terms of quality of the semantic image manipulation while maintaining image identity.

2. RELATED WORK

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have significantly improved realistic image generation in recent years (Brock et al., 2018; Jolicoeur-Martineau, 2019; Karras et al., 2017; 2019a; b; Park et al., 2019; Zhang et al., 2018) . For this, a GAN formulates a 2-player non-cooperative game between two deep nets: (i) a generator that produces an image given a random noise vector in the latent space, sampled from a known prior distribution, usually a normal or a uniform distribution; (ii) a discriminator whose input is both synthetic and real data, which is to be differentiated. Semantic image editing seeks to automate image manipulation of semantics. Encouraged by the success of deep nets, recent works have applied deep learning methods for semantic image editing tasks such as style transfer (Li et al., 2017; Luan et al., 2017) , image-to-image translation (Choi et al., 2018; 2020; Isola et al., 2017; Lee et al., 2020; Wang et al., 2018; Wu et al., 2019; Zhu et al., 2017a) and discovering semantically meaningful directions in a GAN latent space (Härkönen et al., 2020; Jahanian et al., 2019; Plumerault et al., 2020; Shen et al., 2019; Voynov & Babenko, 2020) . Note that our task is an extended version of semantic image editing that requires more comprehensive control to satisfy user-desired operations. Therefore, most aforementioned approaches do not meet



Figure 1: Real image manipulation on scene (top two rows, photo from Flickr) and face (bottom two rows, unseen image from CelebA-HQ) using pretrained StyleGAN2 (Karras et al., 2019b): We reconstruct the real images (col.1) by finding a latent vector with the best inversion result (col.2) on StyleGAN2 (Abdal et al., 2019). After that, we transform the latent vectors for single-and multipleattribute manipulations (col.3-6). Note that unlike ours, the baseline method (Shen et al., 2019) either changes image identity or confounds semantic properties, or both.

