LEARNING INPUT-AGNOSTIC MANIPULATION DIREC-TIONS IN STYLEGAN WITH TEXT GUIDANCE

Abstract

With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the Contrastive Language-Image Pre-training (CLIP) space, and provides it in the form of a Dictionary to quickly find out the channel-wise manipulation direction during inference time. However, in this paper we argue that this dictionary which is constructed by controlling single channel individually is limited to accommodate the versatility of text guidance since the collective and interactive relation among multiple channels are not considered. Indeed, we show that it fails to discover a large portion of manipulation directions that can be found by existing methods, which manually manipulates latent space without texts. To alleviate this issue, we propose a novel method Multi2One that learns a Dictionary, whose entry corresponds to the representation of a single channel, by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text while maintaining the real-time inference speed and disentanglement ability.

1. INTRODUCTION

Wide range of generative models including adversarial networks (Goodfellow et al., 2014; Karras et al., 2018; 2019; 2020b; Kim et al., 2022; Kim & Ha, 2021; Karras et al., 2021) , diffusion models (Dhariwal & Nichol, 2021) , and auto-regressive models (Dosovitskiy et al., 2020; Chang et al., 2022) have demonstrated notable ability to generate a high-resolution image that is hardly distinguishable from real images. Among these powerful models, style-based GAN models (Karras et al., 2019; 2020b) are equipped with a unique latent space which enables style and content mixing of given images, manipulation of local regions (Wu et al., 2021) , and interpolation between different class of images (Sauer et al., 2022) . In this paper, we focus on the image manipulation based on the pre-trained StyleGAN, considering the unique advantages mentioned above and its popularity. Based on the steerability in the latent space of StyleGAN, researchers have put tremendous effort on finding a direction that causes semantically equivalent change to the entire samples of image. In this work, we refer to such latent direction as global direction. Unlike local direction which is a sample-wise traversal direction found by iterative optimization using a single image (Local Basis (Choi et al., 2021) and Latent Optimization of StyleCLIP (Patashnik et al., 2021) ), global direction allows fast inference and is applicable to any images once found using supervised (Jahanian et al., 2019 ), unsupervised (Shen & Zhou, 2021; Wang & Ponce, 2021; Härkönen et al., 2020; Voynov & Babenko, 2020) , or text-guided methods (Global Mapper & GlobalDirectionfoot_0 of StyleCLIP (Patashnik et al., 2021) ).

Smirking man

Man with coy smile Smiling man

Source

Manip.

GD Ours

Grinning man

Original GD Ours

Evil Queen for examples). In addition, we also show that this standard method does not properly perform manipulation on a large number of randomly selected texts (see Fig. 1(b ) for examples). We hypothesize that the failure is due to the naïve approach that only considers a change of image caused by a single channel in StyleSpace, neglecting diverse directions that are visible only when manipulating multiple channels as a whole. In order to address these issues, we propose a novel method, named Multi2One, of learning a Dictionary that can manipulate multiple channels corresponding to a given text. However, here since there is no paired ground truth of text and manipulation direction corresponding to the text, we embed the directions found by existing unsupervised methods into the CLIP space and learn a dictionary to reproduce them in the CLIP space. Note that this has more meaning than simply reproducing the known directions derived by unsupervised methods. As the dictionary learns the relationship between channels in StyleSpace and CLIP space, we can find manipulations that could not be found with unsupervised methods using diverse text inputs. Through extensive experiments, we confirm that contrary to the state-of-the-arts method (Patashnik et al., 2021) which explicitly encoded every single channel, our multi-channel based strategy not only excels in reconstruction of unsupervised directions but also in discovery of text-guided directions.

2. RELATED WORK

Style-based Generators Generators of style-based models (Karras et al., 2019; 2020b; a; 2021) are built upon the progressive structure (Karras et al., 2018) that generates images of higher resolution in deeper blocks. The popularity of StyleGAN structure that has been employed in numerous number of researches comes from its ability to generate high-fidelity images, transfer styles to other images, and manipulate images in the latent spaces using inversion methods (Zhu et al., 2020; Roich et al., 2021; Tov et al., 2021; Collins et al., 2020) 



In order to distinguish it from global direction, which means finding input agnostic directions, we express the method proposed in StyleCLIP in this way



Figure 1: (a) Manipulation by the 70-th direction from GANspace generates 'a man with wide smile'. GlobalDirection (GD), highlighted in red, fails to reproduce similar result even when provided with various text guidances. (b) Manipulation results by randomly selected text, demostrating that GD has insufficient manipulation ability. Same number of channels are manipulated in both methods.

. The latent spaces of StyleGAN used for manipulation are intermediate space W and StyleSpace S (Wu et al., 2021).Unsupervised Global Directions Image-agnostic directions are latent vectors that create semantically equivalent shift when applied to the latent space of StyleGANs. In order to find such directions, SeFa (Shen & Zhou, 2021) performs PCA on the first weight that comes after intermediate space W in pre-trained StyleGAN, deducing the principal components as the global directions. On the other hand, GANspace(Härkönen et al., 2020)  relies on the randomly sampled latent codes in W and

