LEARNING INPUT-AGNOSTIC MANIPULATION DIREC-TIONS IN STYLEGAN WITH TEXT GUIDANCE

Abstract

With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the Contrastive Language-Image Pre-training (CLIP) space, and provides it in the form of a Dictionary to quickly find out the channel-wise manipulation direction during inference time. However, in this paper we argue that this dictionary which is constructed by controlling single channel individually is limited to accommodate the versatility of text guidance since the collective and interactive relation among multiple channels are not considered. Indeed, we show that it fails to discover a large portion of manipulation directions that can be found by existing methods, which manually manipulates latent space without texts. To alleviate this issue, we propose a novel method Multi2One that learns a Dictionary, whose entry corresponds to the representation of a single channel, by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text while maintaining the real-time inference speed and disentanglement ability. ⋯

1. INTRODUCTION

Wide range of generative models including adversarial networks (Goodfellow et al., 2014; Karras et al., 2018; 2019; 2020b; Kim et al., 2022; Kim & Ha, 2021; Karras et al., 2021) , diffusion models (Dhariwal & Nichol, 2021) , and auto-regressive models (Dosovitskiy et al., 2020; Chang et al., 2022) have demonstrated notable ability to generate a high-resolution image that is hardly distinguishable from real images. Among these powerful models, style-based GAN models (Karras et al., 2019; 2020b) are equipped with a unique latent space which enables style and content mixing of given images, manipulation of local regions (Wu et al., 2021) , and interpolation between different class of images (Sauer et al., 2022) . In this paper, we focus on the image manipulation based on the pre-trained StyleGAN, considering the unique advantages mentioned above and its popularity. Based on the steerability in the latent space of StyleGAN, researchers have put tremendous effort on finding a direction that causes semantically equivalent change to the entire samples of image. In this work, we refer to such latent direction as global direction. Unlike local direction which is a sample-wise traversal direction found by iterative optimization using a single image (Local Basis (Choi et al., 2021) and Latent Optimization of StyleCLIP (Patashnik et al., 2021) ), global direction allows fast inference and is applicable to any images once found using supervised (Jahanian et al., 2019) , unsupervised (Shen & Zhou, 2021; Wang & Ponce, 2021; Härkönen et al., 2020; Voynov & Babenko, 2020) , or text-guided methods (Global Mapper & GlobalDirectionfoot_0 of StyleCLIP (Patashnik et al., 2021) ).

Smirking man

Man with coy smile Smiling man

Source

Manip.

GD Ours

Grinning man Among them, the text-guided methods have a unique advantage in that they can naturally provide the flexibility of manipulation through the diversity of the given driving text without human supervision to discover the direction in the latent space. However, in this paper we argue that contrary to this common belief on text-guidance, the standard method (Patashnik et al., 2021) for text-based StyleGAN manipulation surprisingly fails to even find the manipulation directions that are known to be found in unsupervised approaches (Härkönen et al., 2020; Shen & Zhou, 2021 ) (see Fig. 1(a ) for examples). In addition, we also show that this standard method does not properly perform manipulation on a large number of randomly selected texts (see Fig. 1(b) for examples). We hypothesize that the failure is due to the naïve approach that only considers a change of image caused by a single channel in StyleSpace, neglecting diverse directions that are visible only when manipulating multiple channels as a whole.

Attractive Male Innocent Eyes

Original GD Ours Evil Queen Grumpy GD Ours Source GD Ours Source In order to address these issues, we propose a novel method, named Multi2One, of learning a Dictionary that can manipulate multiple channels corresponding to a given text. However, here since there is no paired ground truth of text and manipulation direction corresponding to the text, we embed the directions found by existing unsupervised methods into the CLIP space and learn a dictionary to reproduce them in the CLIP space. Note that this has more meaning than simply reproducing the known directions derived by unsupervised methods. As the dictionary learns the relationship between channels in StyleSpace and CLIP space, we can find manipulations that could not be found with unsupervised methods using diverse text inputs. Through extensive experiments, we confirm that contrary to the state-of-the-arts method (Patashnik et al., 2021) which explicitly encoded every single channel, our multi-channel based strategy not only excels in reconstruction of unsupervised directions but also in discovery of text-guided directions.

2. RELATED WORK

Style-based Generators Generators of style-based models (Karras et al., 2019; 2020b; a; 2021) are built upon the progressive structure (Karras et al., 2018) that generates images of higher resolution in deeper blocks. The popularity of StyleGAN structure that has been employed in numerous number of researches comes from its ability to generate high-fidelity images, transfer styles to other images, and manipulate images in the latent spaces using inversion methods (Zhu et al., 2020; Roich et al., 2021; Tov et al., 2021; Collins et al., 2020) . The latent spaces of StyleGAN used for manipulation are intermediate space W and StyleSpace S (Wu et al., 2021) . Unsupervised Global Directions Image-agnostic directions are latent vectors that create semantically equivalent shift when applied to the latent space of StyleGANs. In order to find such directions, SeFa (Shen & Zhou, 2021) performs PCA on the first weight that comes after intermediate space W in pre-trained StyleGAN, deducing the principal components as the global directions. On the other hand, GANspace (Härkönen et al., 2020) relies on the randomly sampled latent codes in W and  G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T A X X x v O + n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 S R T D J s s E Y n q h F S j 4 B K b h h u B n V Q h j U O B 7 X B 8 O / P b T 6 g 0 T + S j m a Q Y x H Q o e c Q Z N V Z 6 0 H 2 / X 6 1 5 r j c H W S V + Q W p Q o N G v f v U G C c t i l I Y J q n X X 9 1 I T 5 F Q Z z g R O K 7 1 M Y 0 r Z m A 6 x a 6 m k M e o g n 5 8 6 J W d W G Z A o U b a k I X P 1 9 0 R O Y 6 0 n c W g 7 Y 2 p G e t m b i f 9 5 3 c x E N 0 H O Z Z o Z l G y x K M o E M Q m Z / U 0 G X C E z Y m I J Z Y r b W w k b U U W Z s e l U b A j + 8 s u r p H X h + l f u 5 f 1 l r e 4 W c Z T h B E 7 h H H y 4 h j r c Q Q O a w G A I z / A K b 4 5 w X p x 3 5 2 P R W n K K m W P 4 A + f z B w D 8 j Z A = < / l a t e x i t > s1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 z 1 a A Z L O 5 Z R n V o P f l E 1 W G 6 T M n 4 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T J J p x p s s k Y n u h N R w K R R v o k D J O 6 n m N A 4 l b 4 f j 2 5 n f f u L a i E Q 9 4 i T l Q U y H S k S C U b T S g + m r f r X m u d 4 c Z J X 4 B a l B g U a / + t U b J C y L u U I m q T F d 3 0 s x y K l G w S S f V n q Z 4 S l l Y z r k X U s V j b k J 8 v m p U 3 J m l Q G J E m 1 L I Z m r v y d y G h s z i U P b G V M c m W V v J v 7 n d T O M b o J c q D R D r t h i U Z R J g g m Z / U 0 G Q n O G c m I J Z V r Y W w k b U U 0 Z 2 n Q q N g R / + e V V 0 F Y L V i m M a X b G 6 A 7 B O C U D p G u o e 8 = " > A A A B 9 H i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I q O C m 4 M Z l B f u A d i i Z T K Y N z S R j k i m U o d / h x o U i b v 0 Y d / 6 N m X Y W 2 n o g 5 H D O v e T k B A l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f J f 7 n Q l V m k n x a K Y J 9 W M 8 F C x i B B s r + f 1 A 8 l B P Y 3 s h P a j W 3 L o 7 B 1 o l X k F q U K A 5 q H 7 1 Q 0 n S m A p D O N a 6 5 7 m J 8 T O s D C O c z i r 9 V N M E k z E e 0 p 6 l A s d U + 9 k 8 9 A y d W S V E k V T 2 C I P m 6 u + N D M c 6 T 2 Y n Y 2 x G e t n L x f + 8 X m q i G z 9 j I k k N F W T x U J R y Z C T K G 0 A h U 5 Q Y P r U E E 8 V s V k R G W G F i b E 8 V W 4 K 3 / O V V 0 r 6 o e 1 f 1 y 4 f L W u O 2 q K M M J 3 A K 5 + D B N T T g H p r Q A g J P 8 A y v 8 O Z M n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H a M J I k < / l a t e x i t > s < l a t e x i t s h a 1 _ b a s e 6 4 = " y Q F Y L V i m M a X b G 6 A 7 B O C U D p G u o e 8 = " > A A A B 9 H i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I q O C m 4 M Z l B f u A d i i Z T K Y N z S R j k i m U o d / h x o U i b v 0 Y d / 6 N m X Y W 2 n o g 5 H D O v e T k B A l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f J f 7 n Q l V m k n x a K Y J 9 W M 8 F C x i B B s r + f 1 A 8 l B P Y 3 s h P a j W 3 L o 7 B 1 o l X k F q U K A 5 q H 7 1 Q 0 n S m A p D O N a 6 5 7 m J 8 T O s D C O c z i r 9 V N M E k z E e 0 p 6 l A s d U + 9 k 8 9 A y d W S V E k V T 2 C I P m 6 u + N D M c 6 T 2 Y n Y 2 x G e t n L x f + 8 X m q i G z 9 j I k k N F W T x U J R y Z C T K G 0 A h U 5 Q Y P r U E E 8 V s V k R G W G F i b E 8 V W 4 K 3 / O V V 0 r 6 o e 1 f 1 y 4 f L W u O 2 q K M M J 3 A K 5 + D B N T T g H p r Q A g J P 8 A y v 8 O Z M n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H a M J I k < / l a t e x i t > s < l a t e x i t s h a 1 _ b a s e 6 4 = " z o P / V L 2 e 0 B E y C X Z j H Z y F b s W 4 E i M = " > A A A B 7 n i c d V D L S g N B E J y N r x h f U Y 9 e B o M Q L 8 t m s y Y R L w E v H i O Y B y R L m J 3 0 J k N m H 8 z M C m H J R 3 j x o I h X v 8 e b f + N s E k F F C x q K q m 6 6 u 7 y Y M 6 k s 6 8 P I r a 1 v b G 7 l t w s 7 u 3 v 7 B 8 X D o 4 6 M E k G h T S M e i Z 5 H J H A W Q l s x x a E X C y C B x 6 H r T a 8 z v 3 s P Q r I o v F O z G N y A j E P m M 0 q U l r p l C d R 0 z o f F k m V e N m r 2 h Y 0 t 0 7 L q d r W W E b v u 2 F V c 0 U q G E l q h N S y + D 0 Y R T Q I I F e V E y n 7 F i p W b E q E Y 5 T A v D B I J M a F T M o a + p i E J Q L r p 4 t w 5 P t P K C P u R 0 B U q v F C / T 6 Q k k H I W e L o z I G o i f 3 u Z + J f X T 5 T f c F M W x o m C k C 4 X + Q n H K s L Z 7 3 j E B F D F Z 5 o Q K p i + F d M J E Y Q q n V B B h / D 1 K f 6 f d G y z U j O d W 6 f U v F r F k U c n 6 B S V U Q X V U R P d o B Z q I 4 q m 6 A E 9 o W c j N h 6 N F + N 1 2 Z o z V j J 3 N Q X Z I f I p x C 7 S F Z o = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P a U D b b S b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T A X X x v O + n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 S R T D J s s E Y n q h F S j 4 B K b h h u B n V Q h j U O B 7 X B 8 O / P b T 6 g 0 T + S j m a Q Y x H Q o e c Q Z N V Z 6 0 H 2 / X 6 1 5 r j c H W S V + Q W p Q o N G v f v U G C c t i l I Y J q n X X 9 1 I T 5 F Q Z z g R O K 7 1 M Y 0 r Z m A 6 x a 6 m k M e o g n 5 8 6 J W d W G Z A o U b a k I X P 1 9 0 R O Y 6 0 n c W g 7 Y 2 p G e t m b i f 9 5 3 c x E N 0 H O Z Z o Z l G y x K M o E M Q m Z / U 0 G X C E z Y m I J Z Y r b W w k b U U W Z s e l U b A j + 8 s u r p H X h + l f u 5 f 1 l r e 4 W c Z T h B E 7 h H H y 4 h j r c Q Q O a w G A I z / A K b 4 5 w X p x 3 5 2 P R W n K K m W P 4 A + f z B w D 8 j Z A = < / l a t e x i t > s1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 z 1 a A Z L O 5 Z R n V o P f l E 1 W G 6 T M n 4 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T J J p x p s s k Y n u h N R w K R R v o k D J O 6 n m N A 4 l b 4 f j 2 5 n f f u L a i E Q 9 4 i T l Q U y H S k S C U b T S g + m r f r X m u d 4 c Z J X 4 B a l B g U a / + t U b J C y L u U I m q T F d 3 0 s x y K l G w S S f V n q Z 4 S l l Y z r k X U s V j b k J 8 v m p U 3 J m l Q G J E m 1 L I Z m r v y d y G h s z i U P b G V M c m W V v J v 7 n d T O M b o J c q D R D r t h i U Z R J g g m Z / U 0 G Q n O G c m I J Z V r Y W w k b U U 0 Z 2 n Q q N g R / + e V V 0 + 8 = " > A A A C C 3 i c b V C 7 T s M w F H X K q 5 R X g J H F a o V U l i p B F T B W Y m E s E n 1 I T R Q 5 j t N a d e z I d p C q q D s L v 8 L C A E K s / A A b f 4 P T d i g t R 7 J 8 d M 6 9 u v e e M G V U a c f 5 s U o b m 1 v b O + X d y t 7 + w e G R f X z S V S K T m H S w Y E L 2 Q 6 Q I o 5 x 0 N N W M 9 F N J U B I y 0 g v H t 4 X f e y R S U c E f 9 C Q l f o K G n M Y U I 2 2 k w K 5 6 o W C R m i T m g 1 4 6 o v U l I S e B O 7 0 I 7 J r T c G a A 6 8 R d k B p Y o B 3 Y 3 1 4 k c J Y Q r j F D S g 1 c J 9 V + j q S m m J F p x c s U S R E e o y E Z G M p R Q p S f z 2 6 Z w n O j R D A W 0 j y u 4 U x d 7 s h R o o r t T G W C 9 E i t e o X 4 n z f I d H z j 5 5 S n m S Y c z w f F G Y N a w C I Y G F F J s G Y T Q x C W 1 O w K 8 Q h J h L W J r 2 J C c F d P X i f d y 4 Z 7 1 W j e N 2 s t d x F H G Z y B K q g D F 1 y D F r g D b d A B G D y B F / A G 3 q 1 n 6 9 X 6 s D 7 n p S V r 0 X M K / s D 6 + g U x f J s Y < / l a t e x i t > (e 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " I A / U b 0 g + k S 7 9 t w Q  J A 6 U H B t 4 y E S 8 = " > A A A C C 3 i c b V C 7 T s M w F H X K q 5 R X g J H F a o V U l i p B F T B W Y m E s E n 1 I T R Q 5 j t N a d e z I d p C q q D s L v 8 L C A E K s / A A b f 4 P T d i g t R 7 J 8 d M 6 9 u v e e M G V U a c f 5 s U o b m 1 v b O + X d y t 7 + w e G R f X z S V S K T m H S w Y E L 2 Q 6 Q I o 5 x 0 N N W M 9 F N J U B I y 0 g v H t 4 X f e y R S U c E f 9 C Q l f o K G n M Y U I 2 2 k w K 5 6 o W C R m i T m v K v J W X j U n g 1 U 2 4 2 E j s = " > A A A C C 3 i c b V D L S s N A F J 3 4 r P U V d e l m a B F c l a Q U d V n Q h R u x g n 1 A E 8 J k M m m H T i Z h Z i K U k L 0 b f 8 W N C 0 X c + g P u / B s n b R b a e m G Y w z n 3 c s 8 9 f s K o V J b 1 b a y s r q 1 v b F a 2 q t s 7 u 3 v 7 5 s F h T 8 a p w K S L Y x a L g Y 8 k Y Z S T r q K K k U E i C I p 8 R v r + 5 L L Q + w 9 E S B r z e z V N i B u h E a c h x U h p y j N r j h + z Q E 4 j / c E r L 3 M i p M Z + m N 2 k T N H m L S d 5 7 p l 1 q 2 H N C i 4 D u w R 1 U F b H M 7 + c I M Z p R L j C D E k 5 t K 1 E u R k S i m J G 8 q q T S p I g P E E j M t S Q o 4 h I N 5 v d k s M T z Q Q w j I V + X M E Z + 3 s i Q 5 E s 7 O r O w q p c 1 A r y P 2 2 Y q v D C z S h P U k U 4 n i 8 K U w Z V D I t g Y E A F w Y p N N U B Y U O 0 V 4 j E S C C s d X 1 W H Y C + e v A x 6 z Y Z 9 1 m j d t e r t V h l H B R y D G j g F N j g H b X A N O q A L M H g E z + A V v + c = " > A A A B + X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o t Q N y W R o i 4 L 3 S i 4 q G A f 0 I Y y m U 7 a o Z N J m J k U S u i f u H G h i F v / x J 1 / 4 6 T N Q l s P D B z O u Z d 7 5 v g x Z 0 o 7 z r d V 2 N j c 2 t 4 p 7 p b 2 9 g 8 O j + z j k 7 a K E k l o i 0 Q 8 k l 0 f K 8 q Z o C 3 N N K f d W F I c + p x 2 / E k j 8 z t T K h W L x J O e x d Q L 8 U i w g B G s j T S w 7 X 6 I 9 V i G a e P h v j m v 6 M u B X X a q z g J o n b g 5 K U O O 5 s D + 6 g 8 j k o R U a M K x U j 3 X i b W X Y q k Z 4 X R e 6 i e K x p h M 8 I j 2 D B U 4 p M p L F 8 n n 6 M I o Q x R E 0 j y h 0 U L 9 v Z H i U K l Z 6 J v J L K d a 9 T L x P 6 + X 6 O D W S 5 m I E 0 0 F W R 4 K E o 5 0 h L I a 0 J B J S j S f G Y K J Z C Y r I m M s M d G m r J I p w V 3 9 8 j p p X 1 X d 6 2 r t s V a u 1 / I 6 i n A G 5 1 A B F 2 6 g D n f Q h B Y Q m M I z v M K b l V o v 1 r v 1 s R w t W P n O K f y B 9 f k D x r W T E A = = < / l a t e x i t > CLIP(t) < l a t e x i t s h a 1 _ b a s e 6 4 = " I G x 3 / u b o D 2 u d Z z 6 9 7 l z W E B u c W + c = " > A A A B + X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o t Q N y W R o i 4 L 3 S i 4 q G A f 0 I Y y m U 7 a o Z N J m J k U S u i f u H G h i F v / x J 1 / 4 6 T N Q l s P D B z O u Z d 7 5 v g x Z 0 o 7 z r d V 2 N j c 2 t 4 p 7 p b 2 9 g 8 O j + z j k 7 a K E k l o i 0 Q 8 k l 0 f K 8 q Z o C 3 N N K f d W F I c + p x 2 / E k j 8 z t T K h W L x J O e x d Q L 8 U i w g B G s j T S w 7 X 6 I 9 V i G a e P h v j m v 6 M u B X X a q z g J o n b g 5 K U O O 5 s D + 6 g 8 j k o R U a M K x U j 3 X i b W X Y q k Z 4 X R e 6 i e K x p h M 8 I j 2 D B U 4 p M p L F 8 n n 6 M I o Q x R E 0 j y h 0 U L 9 v Z H i U K l Z 6 J v J L K d a 9 T L x P 6 + X 6 O D W S 5 m I E 0 0 F W R 4 K E o 5 0 h L I a 0 J B J S j S f G Y K J Z C Y r I m M s M d G m r J I p w V 3 9 8 j p p X 1 X d 6 2 r t s V a u 1 / I 6 i n A G 5 1 A B F 2 6 g D n f Q h B Y Q m M I z v M K b l V o v 1 r v 1 s R w t W P n O K f y B 9 f k D x r W T E A = = < / l a t e x i t > CLIP(t) < l a t e x i t s h a 1 _ b a s e 6 4 = " N 0 q r d e Z T i T  V 2 / v Z D K Z B f i 4 z l X 3 8 = " > A A A B 7 X i c d V D L S g M x F M 3 4 r P V V d e k m W I S 6 G W a m Y 1 t 3 B T c u K 9 g H t E P J p J k 2 N p O M S U Y o p f / g x o U i b v 0 f d / 6 N m b a C i h 6 4 c D j n X u 6 9 J 0 w Y V d p x P q y V 1 b X 1 j c 3 c V n 5 7 Z 3 d v v 3 B w 2 F I i l Z g 0 s W B C d k K k C K O c N D X V j H Q S S V A c M t I O x 5 e Z 3 7 4 n U l H B b / Q k I U G M h p x G F C N t p F a J 3 N n u W b 9 Q d O y L W s U 7 9 6 B j O 0 7 V K 1 c y 4 l V 9 r w x d o 2 Q o g i U a / c J 7 b y B w G h O u M U N K d V 0 n 0 c E U S U 0 x I 7 N 8 L 1 U k Q X i M h q R r K E c x U c F 0 f u 0 M n h p l A C M h T X E N 5 + r 3 i S m K l Z r E o e m M k R 6 p 3 1 4 m / u V 1 U x 3 V g i n l S a o J x 4 t F U c q g F j B 7 H Q 6 o J F i z i S E I S 2 p u h X i E J M L a B J Q 3 I X x 9 C v 8 n L c 9 2 K 7 Z / 7 R f r / j K O H D g G J 6 A E X F A F d X A F G q A J M L g F D + A J P F v C e r + c = " > A A A B + X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o t Q N y W R o i 4 L 3 S i 4 q G A f 0 I Y y m U 7 a o Z N J m J k U S u i f u H G h i F v / x J 1 / 4 6 T N Q l s P D B z O u Z d 7 5 v g x Z 0 o 7 z r d V 2 N j c 2 t 4 p 7 p b 2 9 g 8 O j + z j k 7 a K E k l o i 0 Q 8 k l 0 f K 8 q Z o C 3 N N K f d W F I c + p x 2 / E k j 8 z t T K h W L x J O e x d Q L 8 U i w g B G s j T S w 7 X 6 I 9 V i G a e P h v j m v 6 M u B X X a q z g J o n b g 5 K U O O 5 s D + 6 g 8 j k o R U a M K x U j 3 X i b W X Y q k Z 4 X R e 6 i e K x p h M 8 I j 2 D B U 4 p M p L F 8 n n 6 M I o Q x R E 0 j y h 0 U L 9 v Z H i U K l Z 6 J v J L K d a 9 T L x P 6 + X 6 O D W S 5 m I E 0 0 F W R 4 K E o 5 0 h L I a 0 J B J S j S f G Y K J Z C Y r I m M s M d G m r J I p w V 3 9 8 j p p X 1 X d 6 2 r t s V a u 1 / I 6 i n A G 5 1 A B F 2 6 g D n f Q h B Y Q m M I z v M K b l V o v 1 r v 1 s R w t W P n O K f y B 9 f k D x r W T E A = = < / l a t e x i t > CLIP(t) < l a t e x i t s h a 1 _ b a s e 6 4 = " y Q F Y L V i m M a X b G 6 A 7 B O C U D p G u o e 8 = " > A A A B 9 H i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I q O C m 4 M Z l B f u A d i i Z T K Y N z S R j k i m U o d / h x o U i b v 0 Y d / 6 N m X Y W 2 n o g 5 H D O v e T k B A l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f J f 7 n Q l V m k n x a K Y J 9 W M 8 F C x i B B s r + f 1 A 8 l B P Y 3 s h P a j W 3 L o 7 B 1 o l X k F q U K A 5 q H 7 1 Q 0 n S m A p D O N a 6 5 7 m J 8 T O s D C O c z i r 9 V N M E k z E e 0 p 6 l A s d U + 9 k 8 9 A y d W S V E k V T 2 C I P m 6 u + N D M c 6 T 2 Y n Y 2 x G e t n L x f + 8 X m q i G z 9 j I k k N F W T x U J R y Z C T K G 0 A h U 5 Q Y P r U E E 8 V s V k R G W G F i b E 8 V W 4 K 3 / O V V 0 r 6 o e 1 f 1 y 4 f L W u O 2 q K M M J 3 A K 5 + D B N T T g H p r Q A g J P 8 A y v 8 O Z M n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H a M J I k < / l a t e x i t > s GD Figure 2: A diagram depicting the framework of dictionary-based image manipulation via text guidance. Our method, Multi2One, is differentiated from the previous methods in that it learns the dictionary for the text input. The proposed novel dictionary allows more flexible and expansive discovery of the manipulation direction ŝ with better result. ϕ(•) in DGlobalDirection is an abbreviation of ϕCLIP(•) of Eq. ( 1). The dictionary learning process of Multi2One to create D Multi2One is illustrated in Sec. 4. the eigenvectors from the latent codes proved to be global directions that share an image-agnostic modification ability. Text-guided Image Manipulations Most of the text-guided image manipulation methods aim to find a local direction (Kocasari et al., 2022; Xia et al., 2021; Patashnik et al., 2021) which is an image-specific direction that applies to a single sample of image. Methods that find local directions using text guidance could be found in various models including GAN, Diffusion (Kim & Ye, 2021; Nichol et al., 2021; Avrahami et al., 2022; Liu et al., 2021) , and Vision transformers (Chang et al., 2022) . Two unique approaches for finding a global direction using text guidance in StyleGAN are Global Mapper and GlobalDirection of StyleCLIP (Patashnik et al., 2021) . Global Mapper finds an image-invariant direction for a single text by optimizing a fully connected layer. However, this method requires 10 hours of training time for every single text, making it less popular than GlobalDirection. On the other hand, GlobalDirection method offers a real-time manipulation in inference time using a dictionary-based framework that is applicable to any input image and text.

3. LIMITED COVERAGE OF STYLECLIP GL O B A LDI R E C T I O N METHOD

In this section, we briefly review the GlobalDirection of StyleCLIP (Patashnik et al., 2021) , which performs StyleGAN-based image manipulation using text guidance (Sec. 3.1). Then, we provide our key motivation that this state-of-the-art text-guided manipulation method is surprisingly insufficient to fully utilize the manipulative power of StyleGAN (Sec. 3.2).

StyleSpace of StyleGAN

The generators of StyleGAN family (Karras et al., 2019; 2020a; b) have a number of latent spaces: Z, W, W+, and S. The original latent space is Z, which is typically the standard normal distribution. The generator transforms an input noise z ∼ N (0, I) into intermediate latent spaces W (Karras et al., 2019) , W+ (Abdal et al., 2019) , and S (Wu et al., 2021) , sequentially. Recent study (Wu et al., 2021) shows that StyleSpace S is the most disentangled such that it can change a distinct visual attribute in a localized way. The number of style channels in S is 6048 excluding toRGB channels for StyleGAN-ADA (resolution 1024 2 ), and recent methods (Kocasari et al., 2022; Patashnik et al., 2021; Wu et al., 2021) modify the values of the channels to edit an image. Our method also adopts StyleSpace and we use n to denote the total number of channels in S (that is, s = [s 1 , s 2 , • • • , s n ] T ∈ S and s i is a single parameter). For the convenience of explanation, the pre-trained generator G is re-defined with respect to StyleSpace as X = G s (s); X is a generated image. The goal of StyleGAN-based image manipulation via text is to find a direction ŝ = [ŝ 1 , ŝ2 , ..., • • • , ŝn ] T in StyleSpace S which generates an image X edited = G s (s + ŝ) suitable for the provided text guidance t. Note that s is the inverted style vector of image X, found via StyleGAN inversion methods such as Alaluf et al. (2021) ; Roich et al. (2021) ; Tov et al. (2021) , used for manipulation purpose in most cases. The main benefit of considering StyleSpace S in image manipulation is that it does not change the undesired regions of the image when modifying a small number of style channels by its well disentangled property. Table 1 : Measurement of CLIP similarity score cos(•, •) (↑) between the manipulated image and the CLIP representation ϕ CLIP (•) of unsupervised direction α. cos(ϕ CLIP (α), X unsup ) cos(ϕ CLIP (α), X GD ) α (10) 0.3295 0.2701 α (30) 0.3558 0.3319 α (50) 0.3687 0.2924 Text-guided Manipulation by StyleCLIP The GlobalDirection (Patashnik et al., 2021) is a representative method of text-driven image manipulation that provides an input-agnostic direction with a level of disentanglement. Intuitively, it computes the similarity between the input text and each style channel in the CLIP space (Radford et al., 2021) to find the channels that should be modified given the text. In order to compute the similarity between the text and style channel, both should be encoded into CLIP. While the text guidance t is trivially encoded via the text encoder of CLIP as CLIP(t) ∈ R p , style channels in StyleSpace S need additional pre-processing for the embedding. GlobalDirection proposes to embed the manipulation effect of i-th style channel using the following mapping from the StyleSpace S to CLIP space: ϕ CLIP (e i ) := E s∈S CLIP G s (s + e i ) -CLIP G s (s -e i ) where the manipulation vector e i ∈ R n is a zero-vector except for the i-th entry. Adding the manipulation vector e i to the original style vector s indicates that only the i-th channel among n channels in StyleSpace is manipulated. Note that ϕ CLIP (e i ) is also p-dimensional since the CLIP encoder maps images generated by G s (•) into a p-dimensional CLIP space. The above mapping in Eq. ( 1) is enumerated across all n channels in StyleSpace S to create a dictionary D GlobalDirection := [ϕ CLIP (e 1 ), ϕ CLIP (e 2 ), • • • , ϕ CLIP (e n )] ∈ R p×n . Finally, with this dictionary, the manipulation direction ŝ ∈ R n by GlobalDirection is given as the similarity score measured by the following equation: ŝ = D T GlobalDirection CLIP(t). This overall manipulation procedure is visualized in Fig. 2 . In the following section (Sec. 3.2), we propose the evidence to the hypothesis that the single-channel encoding strategy with ϕ CLIP (e i ) to create the dictionary D is the major bottleneck causing limited coverage issues that StyleCLIP suffers from.

3.2. COVERAGE ANALYSIS OF STYLECLIP GL O B A LDI R E C T I O N METHOD

In this section, we experimentally verify that GlobalDirection has limited ability to find manipulation directions, as described in Fig. 1 First, we show that many edits using unsupervised methods (Härkönen et al., 2020; Shen & Zhou, 2021 ) cannot be recovered by GlobalDirection. For example, in Fig. 1 (a), we can observe that applying the 70-th GANspace direction manipulates the source image to become a man with wide smile, showing that pre-trained StyleGAN itself is capable of such manipulation. However, GlobalDirection (GD) in Fig. 1 (a) constantly fails to recover the same editing effect of the unsupervised direction, despite the variation on the text input such as 'smiling man', 'grinning man', 'smirking man', and 'man with coy smile'. More quantitatively, we provide Tab. 1 to show that GlobalDirection (Patashnik et al., 2021) cannot effectively recover the directions found by unsupervised methods (Härkönen et al., 2020; Shen & Zhou, 2021) . Scores in the table are the CLIP similarity between ϕ CLIP (α ) = E s∈S [CLIP G s (s+ α) -CLIP G s (s-α) ] and the modified image created by either GlobalDirection (X GD ; the construction of X GD is explained below) 2 or unsupervised methods (X unsup ). Note that ϕ CLIP (α) encodes the direction α into the CLIP embedding space as GlobalDirection does for a single < l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J  C E A F P n 1 m 4 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d N F P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R o r v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D E + n z B u Z j 5 O 1 W b e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J m r J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o Y Z F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v U S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v 8 n j K M I J n M I 5 e H A F d b i D B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P f A e M v Q = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J C E A F P n 1 m 4 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d N F P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R o E + n z B u Z j 5 O 1 W b e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J m r J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o Y Z F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v U S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v 8 n j K M I J n M I 5 e H A F d b i D B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P f A e M v Q = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J C E A F P n 1 m 4 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d N F P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R o e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J m Z F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J m Z F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v X v g = " > A A A C G n i c d V D L S s N A F J U V v q k s g W o m C k t a Q j c K L i r Y V m h C m E y m d D J g m J U E K / w / s a F I u E j X / j p K Q R Q M c z j X O x s Z F d I w P r X c y u r a + k Z + s C v b O V w / I o o Z h c M Q i f u s h Q R g N S U d S y c h t z A k K P E Z r i V X t h A s a h T d y E h M n Q M O Q D i h G U k l u b S i P l i E q g P v G I u q k d I D n i Q d q u m x P p + V l A H p q V s s G f p o a d W d D Q D a N u V W o Z s e p V q w J N p W Q o g Q X a b v H d i O c B C S U m C E h + q Y R S y d F X F L M y L R g J L E C I / R k P Q V D V F A h J P O T p v C E X c B B x U I J Z + p y R o C k S n n N n a n c t E / + q R M a D g p D e N E k h D P B w S B m U E s y g T z n B k k U Q Z h T t S v E I Q R l i r N g g r h + L P + l a u l n T q f V U l N f x J E H R + A Y l I E J q A J L k A b d A A G + A R P I M X U F l t k p y D s E P a B f a E G h q w = = < / l a t e x i t > CLIP (ei) < l a t e x i t s h a 1 _ b a s e 6 4 = " t v 5 N G + 0 y q g 2 g c H n  P e A e v 4 o v G X v g = " > A A A C G n i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 W o m 5 C k t a 2 7 Q j c K L i r Y V m h C m E y m 7 d D J g 5 m J U E K / w 4 2 / 4 s a F I u 7 E j X / j p K 1 Q R Q 8 M c z j 3 X O 6 9 x 4 s Z F d I w P r X c y u r a + k Z + s 7 C 1 v b O 7 V 9 w / 6 I o o 4 Z h 0 c M Q i f u s h Q R g N S U d S y c h t z A k K P E Z 6 3 r i V 1 X t 3 h A s a h T d y E h M n Q M O Q D i h G U k l u 0 b S 9 i P l i E q g P 2 v G I u q k d I D n i Q d q 6 u m x P p + V l A 3 H p q V s s G f p 5 o 2 a d W d D Q D a N u V W o Z s e p V q w J N p W Q o g Q X a b v H d 9 i O c B C S U m C E h + q Y R S y d F X F L M y L R g J 4 L E C I / R k P Q V D V F A h J P O T p v C E Q P h g k b h n Z z E x A n Q M K Q D i p F U k q v X b C 9 i v p g E 6 o N 2 P K J u a g d I j n i Q t m 6 u 2 9 N p a c W A W D x C L j 1 3 9 a J p X D Z q 1 o U F T c M 0 6 1 a l l h G r X r U q s K y U D E W w R N v V P 2 0 / w k l A Q o k Z E q J f N m P p p I h L i h m Z F u x E k B j h M R q S v q I h C o h w 0 v l 9 U 3 i m F B 8 O I q 5 e K O F c / d m R o k B k G y p n t r v 4 X c v E v 2 r 9 R A 4 a T k r D O J E k x I t B g 4 R B G c E s L O h T T r B k E 0 U Q 5 l T t C v E I c Y S l i r S g Q v i + F P 5 P u p Z R r h n V 2 2 q x a S z j y I M T c A p K o A z q o A m u Q B t 0 A A a P 4 B m 8 g j f t S X v R 3 r W P h T W n L X u O w Q q 0 2 R e K a q P a < / l a t e x i t > CLIP (↵i) < l a t e x i t s h a 1 _ b a s e 6 4 = " V o e h U a f f i 3 B S D E F 2 + w l Q a 2 O n u m 8 = " > A A A C H 3 i c d V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o m 5 C m t a 2 7 Q j c K L i r Y W m h C m E y m 7 d D J g 5 m J U E L / x I 2 / 4 s a F I u K u f + O k r W B F D w x z O P d c 7 r 3 H i x k V 0 j R n W m 5 t f W N z K 7 9 d 2 N n d 2 z / Q D 4 + 6 I k o 4 J h 0 c s Y j 3 P C Q I o y H p S C o Z 6 c W c o M B j 5 N 4 b t 7 L 6 / Q P h g k b h n Z z E x A n Q M K Q D i p F U k q v X b C 9 i v p g E 6 o N 2 P K J u a g d I j n i Q t m 6 u 2 9 N p a c W A W D x C L j 1 3 9 a J p X D Z q 1 o U F T c M 0 6 1 a l l h G r X r U q s K y U D E W w R N v V P 2 0 / w k l A Q o k Z E q J f N m P p p I h L i h m Z F u x E k B j h M R q S v q I h C o h w 0 v l 9 U 3 i m F B 8 O I q 5 e K O F c / d m R o k B k G y p n t r v 4 X c v E v 2 r 9 R A 4 a T k r D O J E k x I t B g 4 R B G c E s L O h T T r B k E 0 U Q 5 l T t C v E I c Y S l i r S g Q v i + F P 5 P u p Z R r h n V 2 2 q x a S z j y I M T c A p K o A z q o A m u Q B t 0 A A a P 4 B m 8 g j f t S X v R 3 r W P h T W n L X u O w Q q 0 2 R e K a q P a < / l a t e x i t > CLIP (↵i) < l a t e x i t s h a 1 _ b a s e 6 4 = " V o e h U a f f i 3 B S D E F 2 + w l Q a 2 O n u m 8 = " > A A A C H 3 i c d V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o m 5 C m t a 2 7 Q j c K L i r Y W m h C m E y m 7 d D J g 5 m J U E L / x I 2 / 4 s a F I u K u f + O k r W B F D w x z O P d c 7 r 3 H i x k V 0 j R n W m 5 t f W N z K 7 9 d 2 N n d 2 z / Q D 4 + 6 I k o 4 J h 0 c s Y j 3 P C Q I o y H p S C o Z 6 c W c o M B j 5 N 4 b t 7 L 6 / Q P h g k b h n Z z E x A n Q M K Q D i p F U k q v X b C 9 i v p g E 6 o N 2 P K J u a g d I j n i Q t m 6 u 2 9 N p a c W A W D x C L j 1 3 9 a J p X D Z q 1 o U F T c M 0 6 1 a l l h G r X r U q s K y U D E W w R N v V P 2 0 / w k l A Q o k Z E q J f N m P p p I h L i h m Z F u x E k B j h M R q S v q I h C o h w 0 v l 9 U 3 i m F B 8 O I q 5 e K O F c / d m R o k B k G y p n t r v 4 X c v E v 2 r 9 R A 4 a T k r D O J E k x I t B g 4 R B G c E s L O h T T r B k E 0 U Q 5 l T t C v E I c Y S l i r S g Q v i + F P 5 P u p Z R r h n V 2 2 q x a S z j y I M T c A p K o A z q o A m u Q B t 0 A A a P 4 B m 8 g j f t S X v R 3 r W P h T W n L X u O w Q q 0 2 R e K a q P a < / l a t e x i t > CLIP (↵i) < l a t e x i t s h a 1 _ b a s e 6 4 = " V o e h U a f f i 3 B S D E F 2 + w l Q a 2 O n u m 8 = " > A A A C H 3 i c d V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o m 5 C m t a 2 7 Q j c K L i r Y W m h C m E y m 7 d D J g 5 m J U E L / x I 2 / 4 s a F I u K u f + O k r W B F D w x z O P d c 7 r 3 H i x k V 0 j R n W m 5 t f W N z K 7 9 d 2 N n d 2 z / Q D 4 + 6 I k o 4 J h 0 c s Y j 3 P C Q I o y H p S C o Z 6 c W c o M B j 5 N 4 b t 7 L 6 / Q P h g k b h n Z z E x A n Q M K Q D i p F U k q v X b C 9 i v p g E 6 o N 2 P K J u a g d I j n i Q t m 6 u 2 9 N p a c W A W D x C L j 1 3 9 a J p X D Z q 1 o U F T c M 0 6 1 a l l h G r X r U q s K y U D E W w R N v V P 2 0 / w k l A Q o k Z E q J f N m P p p I h L i h m Z F u x E k B j h M R q S v q I h C o h w 0 v l 9 U 3 i m F B 8 O I q 5 e K O F c / d m R o k B k G y p n t r v 4 X c v E v 2 r 9 R A 4 a T k r D O J E k x I t B g 4 R B G c E s L O h T T r B k E 0 U Q 5 l T t C v E I c Y S l i r S g Q v i + F P 5 P u p Z R r h n V 2 2 q x a S z j y I M T c A p K o A z q o A m u Q B t 0 A A a P 4 B m 8 g j f t S X v R 3 r W P h T W n L X u O w Q q 0 2 R e K a q P a < / l a t e x i t > CLIP (↵i) < l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J channel. From the table, we can confirm that GlobalDirection does not recover the similarity in CLIP space that the unsupervised methods have. C E A F P n 1 m 4 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J m Z F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s t H M 0 n Q j + h Q 8 p A z a q z 0 0 E t 4 v 1 x x q + 4 c Z J V 4 O a l A j k a / / N U b x C y N U B o m q N Z d z 0 2 M n 1 F l O B M 4 L f V S j Q l l Y z r E r q W S R q j 9 b H 7 q l J x Z Z U D C W N m S h s z V 3 x M Z j b S e R I H t j K g Z 6 W V v J v 7 n d V M T X v s Z l 0 l q U L L F o j A V x M R k 9 j c Z c I X M i I k l l C l u b y V s R B V l x q Z T s i F 4 y y + v k t Z F 1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d d d z E + N n V B n O B E 5 L v V R j Q t m Y D r F r q a Q R a j + b H z o l Z 1 Y Z k D B W t q Q h c / X 3 R E Y j r S d R Y D s j a k Z 6 2 Z u J / 3 n d 1 I T X f s Z l k h q U b L E o T A U x M Z l 9 T Q Z c I T N i Y g l l i t t b C R t R R Z m x 2 Z R s C N 7 y y 6 u k f V H 1 L q u 1 Z q 1 S v Here we illustrate the details on how we generate X unsup and X GD for Tab. 1. First, the source image X = G s (s) is manipulated by each unsupervised direction α ∈ R n into X unsup = G s (s + α). Instead of naïvely using the direction α ∈ R n , we only use the channels whose magnitude is the k-largest among n channels in α to reduce the entanglement that naturally comes with directions found by unsupervised methods (please refer to Appendix C to find the effect of using limited number of channels). We denote such sparsified directions as α (k) ∈ R n and use them in the experiment of Tab. 1. Then, to generate X GD = G s (s + α), we compute α = D T GlobalDirection CLIP(t) ∈ R n as in Eq. ( 2). However, since there is no text corresponding to α and our objective in this experiment is to find a direction α that may recover X unsup by GlobalDirection, we replace CLIP(t) with ϕ CLIP (α (k) ) = E s∈S [CLIP G s (s + α (k) ) -CLIP G s (s -α (k) ) ]. Since ϕ CLIP (α (k) ) indicates the manipulation effect of α (k) encoded into CLIP space, it is used as a substitute for the CLIPencoded text guidance, CLIP(t). Finally, we compute the similarity score for 30 instances of source images and 1024 directions from SeFa and GANspace whose average is reported in Tab. 1. Secondly in Fig. 1 (b), we provide several instances where the GlobalDirection of StyleCLIP fails to manipulate an image given randomly selected text. For additional examples on the failure cases of the standard method, please refer to Appendix A where we provide the details on how we choose random texts and manipulated images using extensive list of random texts. Regarding the phenomena found in the above experiments, we hypothesize the crude assumption of GlobalDirection -manipulation solely based on single channel could fully represent the manipulative ability of StyleGAN -leads to the limited coverage issue. To disprove this assumption of GlobalDirection, in Fig. 3(a) , we demonstrate that single-channel manipulation framework of GlobalDirection (using e i ) leads to improper construction of ϕ CLIP (e i ) in the dictionary D GlobalDirection . Then in Fig. 3 (b), we show an example where editing two channels has a completely different effect from manipulating each of the channels individually, which proves the need for multi-channel manipulation. In Fig. 3 (a), we show that ϕ CLIP (e i ) = E CLIP G s (s + e i ) -CLIP G s (se i ) , which forms the dictionary D GlobalDirection , is not a robust representative of i-th style channel since there exists sample-wise inconsistency in the calculation of CLIP G s (s + e i ) -CLIP G s (se i ) . On the left side of Fig. 3 (a), the distribution of CLIP G s (s + e i ) -CLIP G s (se i ) and its average over the samples, ϕ CLIP (e i ), are shown. As the samples show inconsistent angles over polar coordinatefoot_2 when modified by a single-channel manipulation e i , we conclude that ϕ CLIP (e i ) may not be a trust-worthy representative of i-th channel in StyleSpace. Thereby the basic assumption held by GlobalDirection which expects a single channel manipulation using e i on any images to be semantically consistent in CLIP space, in fact, has numerous counter-examples showing inconsistency. As discussed in Sec. 3.2, each style channel rather than the whole may not be meaningfully embedded in CLIP space. To address this issue, we propose a dictionary learning approach to find a robust representation of styles that could be possibly a set of channels in StyleSpace S. Ideally, given a text input t and its CLIP encoding CLIP(t) ∈ R p , we would like to construct a dictionary D ∈ R p×n which finds a manipulation direction ŝ = D T CLIP(t) ∈ R n as in Eq. (2). The direction ŝ is expected to manipulate an image into G s (s + ŝ) which faithfully represents t. Let us for now assume that the 'ground truth' manipulation direction ŝt ∈ R n corresponding to the driving text t is known. Then our estimated direction ŝ = D T CLIP(t) should generate an image X edited = G s (s + D T CLIP(t)) close to G s (s + ŝt ). This leads to the objective minimize D D T CLIP(t) -ŝt 2 2 (3) which we name ideal dictionary learning problem since the ground truth direction ŝt for a driving text t is unknown. To avoid this unrealistic assumption, we substitute ŝt with the known directions α ∈ R n derived from unsupervised methods (Härkönen et al., 2020; Shen & Zhou, 2021) . CLIP(t) is also replaced by ϕ CLIP (α) (see Eq. ( 1) for the definition of ϕ CLIP (•)) since we would like to use α itself as a driving guidance rather than finding out corresponding text t that requires labor-intensive labeling. Then, Eq. (3) becomes minimize D ∀α D T ϕ CLIP (α) -α 2 2 . ( ) We construct the set of known directions α in Eq. ( 4) by using all 512 directions found by GANspace and directions with top 80 eigenvalues out of 512 from SeFafoot_3 (Härkönen et al., 2020; Shen & Zhou, 2021) . Instead of naïvely using the unsupervised direction α ∈ R n , we prune the channels in directions to activate only k channels among n channels, zeroing out the other nk channels that have small magnitudes. Such pruning is represented by α (k) ∈ R n where we use k = 10, 30, 50 in this paper (the effect of k is described in Tab. 5 in the appendix). The reason for pruning the number of activated channels is that unsupervised methods show instability in terms of maintaining To help close the gap between Eq. (3) and Eq. ( 4), we introduce an additional loss term ∥Dα (k)ϕ CLIP (α (k) )∥ 2 2 to update the dictionary D ∈ R p×n . This additional term comes from the observation in Sec. 3.2 that multi-channel manipulation should be considered. Unlike GlobalDirection which directly encodes ϕ CLIP (e i ) for all n channels in StyleSpace, we aim to learn the multi-channel manipulation by α (k) whose editing effect is mapped from StyleSpace to CLIP space by ϕ CLIP (•). The main obstacle in this objective is identifying which channel is responsible for which part of the change induced by total of k channels in the given direction α (k) . Here, we emphasize that augmenting the same direction α into α (10) , α (30) , α (50) can encourage the dictionary learning process to identify a disentangled editing effect of each channel which will be explained further below. Based on the concept figure in Fig. 4 , we explain how the additional loss term encourages disentangled learning of dictionary D. We present a simplified version where k = 1, 2, 3, creating three augmentations α (1) , α (2) , α (3) for the direction α. Suppose manipulation using α (3) = [s 1 , s 2 , s 3 ] T modifies the source image to have 1) thick eyebrows, 2) pink lipstick and 3) glowing skin. Further suppose manipulation by α (2) = [s 1 , 0, s 3 ] T modifies 1) thick eyebrows and 2) pink lipstick. Finally, manipulation using α (1) = [s 1 , 0, 0] T modifies 1) thick eyebrows. If the loss term ∥Dα (k)ϕ CLIP (α (k) )∥ 2 2 is perfectly optimized to satisfy Dα (k) = ϕ CLIP (α (k) ), then s 1 d 1 ≈ CLIP(thick eyebrows), s 1 d 1 + s 3 d 3 ≈ CLIP(thick eyebrows & pink lipsticks) and s 1 d 1 + s 2 d 2 + s 3 d 3 ≈ CLIP(thick eyebrows & pink lipsticks & glowing skin). Therefore, the linear combination on the sparsified direction helps to specifically link 'thick eyebrows' to d 1 'pink lipstick' to d 3 , and 'glowing skin' to d 2 . Hence our method may have greatly disentangled and diverse directions coming from the enriched coverage over manipulation in StyleGAN, thanks to the dictionary learning based on multi-channel manipulation. Moreover, even though our α comes from the unsupervised methods, our method learns the individual role of s 1 , s 2 and s 3 from the single unsupervised direction α = [s 1 , s 2 , s 3 ] T . This leads to greatly diverse and disentangled results compared to unsupervised methods which will be proved in Tab. 3. Finally, combining these two ingredients together yields our dictionary learning problem by considering all known directions α (k) : minimize D ∀α (k) D T ϕ CLIP (α (k) ) -α (k) 2 2 + λ • Dα (k) -ϕ CLIP (α (k) ) 2 2 (5 ) where λ is the tunable hyper-parameter and is set as 0.01 in all our experiments below.

5. EXPERIMENTS

We evaluate the proposed method on StyleGAN2 (Karras et al., 2020a) . We used StyleGAN2-ADA pretrained on multiple datasets, including FFHQ (Karras et al., 2019) , LSUN (Yu et al., 2015) Car, Church, AFHQ (Choi et al., 2020 ) Dog, Cat. In this section, we show the widened StyleGAN manipulation coverage of our method compared to a text-guided method, GlobalDirection of StyleCLIP (Patashnik et al., 2021) mainly on FFHQ pre-trained model. For additional details in experimental design, please refer to Appendix E.

Quantitative Results

We measure the superiority of our method in terms of recovering the unsupervised directions α and text guidance t. The metrics are defined in StyleSpace S to evaluate over Table 3 : Cosine similarity RCLIP (Eq. ( 7); ↑) measured in CLIP Space. For unsupervised method, which does not depend on text guidance, we choose the text-related unsupervised direction α that has the highest similarity value of cos(ϕCLIP(α), CLIP(t)) given text t. We also present the similarity value on our method where hyper-parameter λ is 0 for ablation study on the second term of Eq. ( 5).

Methods Unsupervised Global Direction

Ours (λ = 0) unsupervised directions and in CLIP space for evaluation over text guidance. Multi2One recover the unsupervised direction with maximum manipulative ability and minimum entanglement, for demonstration of which we define two metrics as follows. Ours (λ = 0. • StyleSpace metric for unsupervised directions (Tab. 2): We employ the following metric: MSE(α, α) := E α∼A ∥ α -α∥ 2 2 (6) where α is from a set of unsupervised directions A. Note that the ground truth direction α in this metric is α (100) , where top-100 channels from unsupervised directions are modified while other n -100 channels are zero entries. Therefore the reconstruction of such direction α (100) , given by α = D T ϕ CLIP (α (100) ) is expected to have 100 style channels whose values are similar to that of non-zero entries in α (100) while the other n -100 styles are close to 0. MSE α, α| ̸ =0 denotes the difference between α and α measured between the 100 non-zero style channels that represents the ability of the given dictionary D to find the precise manipulation direction. On the other hand, MSE α, α| =0 is the distance between α and α measured within n -100 zero-entries. This represents the level of disentanglement since zero-entries are not expected to be modified in α. • CLIP space metric for text guidance (Tab. 3): We measure the similarity between edited image and the text guidance in CLIP space as follows: R CLIP (t, X edited ) := E t∼T cos CLIP(t), CLIP(X edited ) . (7) We denote the set of text prompts that describes the pre-trained domain of StyleGAN, namely FFHQ as T whose construction is described in Appendix A. Tab. 2 shows that Multi2One consistently outperforms GlobalDirection in terms of MSE score between the unsupervised direction α and the estimated direction α. Especially, note that Multi2One have lower MSE α, α| =0 than GlobalDirection, proving that our method shows less entanglement. Moreover, Tab. 3 shows that the cosine similarity between the given text guidance and the manipulated image is always the highest using our method. Most importantly, we emphasize that our method could find manipulation directions that could not have been found by unsupervised methods. We prove this claim using 57 text prompts (t ∼ T) as a guidance to manipulate the images. Since we aim to find the unsupervised direction that could edit the image to become t, we select the α whose similarity cos(ϕ CLIP (α), CLIP(t)) is the largest then manipulate the image to produce X unsup = G s (s + α). In Tab. 3, Multi2One scores better R CLIP (t, X edited ) compared to unsupervised methods. Therefore, we claim that our method successfully learns to find expansive editing directions which could not be found by unsupervised methods given diverse text inputs. Furthermore, we report R CLIP (t, X edited ) score with ablation on the second term of Eq. (5) in Tab. 3. For visual examples see Appendix G.

Qualitative Results

We demonstrate the effectiveness of our method, compared to the state-of-thearts method, GlobalDirection. We conducted all experiments using the pretrained models and the precomputed set of CLIP embeddings for StyleCLIP, that are readily provided for FFHQ, AFHQ Cat and AFHQ Dog. We computed the CLIP embeddings of the StyleSpace of LSUN Church and Car using the same configuration with GlobalDirection. All the results are compared under same condition, where manipulation strength and disentanglement levels are set to be equal. More specifically, we modify same number of channels for both methods and the magnitude of changefoot_4 in StyleSpace is always fixed as 10. The manipulated images show that GlobalDirection fails to represent some of the most basic text such as 'Young' for FFHQ. On the other hand, ours successfully manipulates the images not only on the basic text guidances but also on advanced semantics such as 'Little mermaid' and 'Joker smile' for FFHQ. We emphasize that despite the dictionary of our method is learned from the known directions in unsupervised approaches (Shen & Zhou, 2021; Härkönen et al., 2020) , the manipulation results show that our learned dictionary could adapt to previously unseen combination of semantics such as red hair, pale skin, and big eyes to represent 'Little Mermaid' and unnatural smiles with red lipstick and pale face to represent 'Joker smile'. Here we show that our method successfully discovers a direction that does not exist in unsupervised direction. To be more specific, we manipulate the image using an unsupervised direction α whose CLIP representation, ϕ CLIP (α), is the most similar with the text 'Little Mermaid' and 'Joker smile'. Then from the manipulated image, we observe that the manipulation result is hardly plausible compared to the successful results in Fig. 5 . Based on this examples, we emphasize that even though the dictionary learning process relies on the unsupervised methods, Multi2One demonstrate wider variety of manipulation results using the versatility of text. Little mermaid (similarity 0. We show additional results of our method compared to GlobalDirection based on AFHQ and LSUN dataset in Appendix D due to space constraints.

6. CONCLUSION

Text-guided image manipulation has been largely preferred over unsupervised methods in finding manipulation directions due to its ability to find wide variety of directions corresponding to flexible text input. However, we have investigated that the state-of-the-art method of text-guided image manipulation has limited coverage over possible editing directions using StyleGAN and this problem rises from a simple assumption that a single-channel manipulation could fully represent the manipulation ability of StyleGAN. To overcome this issue, we have proposed a dictionary learning framework that embeds the interactive and collective editing effect based on modifying multiple style channels into CLIP space jointly. Our method have been proven to be superior over other text-guided image manipulation methods in terms of manipulative ability and disentanglement.

REPRODUCIBILITY STATEMENT

Our code is open sourced at here.

A MORE EXAMPLES ON FIG. 1(B)

In this section, we show the extensive results on the limited coverage of GlobalDirection on wide variety of text guidance. The experiment is conducted on FFHQ-pretrained StyleGAN-ADA, and the text guidance are the adjective-noun pairs extracted from Visual Semantic Ontology (Jou et al., 2015) from which extract the nouns that describe human face. For example, the adjective-noun pairs contain human face descriptions such as 'Handsome smile', 'Bad hair', and 'Stupid face'. The number of such adjectivenoun pair amounts to 57 instances. We manipulate the original image on the leftmost side using GlobalDirection and Multi2One using the same set of text description and equal configuration for both methods (manipulated channels 5, manipulation strength 5). The result in Fig. 7 shows that StyleCLIP GlobalDirection constantly fails to represent the given text guidance faithfully, manifesting its limited capacity in exploiting the flexibility of text hence the need for Multi2One to overcome the coverage issue.

B MORE DETAILS ON FIG. 3(A)

Both GlobalDirection and Multi2One rely on multiple samples of images to encode the averaged difference caused by manipulating a sample into positive and negative direction. The difference is that unlike GlobalDirection which relies on a single channel manipulation, Multi2One encode change of image caused by image-agnostic direction found by unsupervised methods (Härkönen et al., 2020; Shen & Zhou, 2021) . Single channel manipulation is not image-agnostic since the change caused by moving a single channel may not be applicable to certain source images used as a sample. For unconditional FFHQ pretrained StyleGAN-ADA, we have observed that a style channel which is known to make a person's hair grey fails to modify half of the samples. On the other hand, the image-agnostic directions found by unsupervised methods hardly fails to manipulate any given image to become older with grey hair. As a consequence, the CLIP embeddings of difference caused by single channel manipulation may greatly vary sample by sample while image-agnostic directions show consistent direction of CLIP embeddings due to the successful manipulation on any given sample of images. We visualize the direction of each sample when a single channel is modified, and when 100 or 200 channels from unsupervised directions are modified. More specifically, for a single channel manipulation of GlobalDirection the samples are E G s (s + s i ) -E G s (ss i ) and for multiple channel manipulation, the samples are E G s (s + α) -E G s (sα) where s ∼ S. The visualization is conducted in a similar manner with the work of Wang & Isola (2020) . First we embed the p = 512 dimensional CLIP vectors into 2-dimension using t-SNE (Van der Maaten & Hinton, 2008) . We denote each sample as (a i , b i ) for i = 1, 2, • • • , 100 and the averaged direction Target Image Source Image GD Ours Source Image GD Ours  of interest E E G s (s + s i ) -E G s (s -s i ) in 2-dimensional space is (a, b).

E EXPERIMENTAL DETAILS

We use unsupervised directions from SeFa and GANspace, both of which are found in intermediate space W limiting the maximum number of directions to 512, which is the dimension of the intermediate latent space. We use 512 directions from GANSpace and 80 directions from SeFa. Since the earlier layers of StyleGAN mainly shows large structural changes while the deep layers are related to a small detailed changes, we apply the 1024 vectors on three groups of layers: coarse(resolution 4∼32), medium(resolution 64∼128) and fine(resolution 128∼) (Karras et al., 2020a) , which amounts to total of 1776 unsupervised directions. Since the text-guided manipulation directions are found in StyleSpace S, we map the 1776 directions into StyleSpace S, where the maximum value of the style channel parameter in each direction is 15. The unsupervised directions are dense vectors, where the elements are non-zeros affecting every channel in StyleSpace. Dense manipulation vectors are prone to modify multiple regions of images, causing an image to lose the original identity. Sparsifying the known manipulation vectors α into α (k) where k = 10, 30, 50 boosts the disentangled property of the learned dictionary. We train the dictionary from zero matrix with 30000 epochs. The hyper-parameter λ is set as 0.01, and the learning rate is 3.0 with Adadelta (Zeiler, 2012), an optimization method with adaptive learning rate. F COMPARISON WITH OTHER SOTA METHODS (Kocasari et al., 2022) . Multi2One and StyleCLIP GlobalDirection are two unique methods that provides instant manipulation on provided text and image without additional computation for optimization. Local optimization (Patashnik et al., 2021) and StyleMC (Kocasari et al., 2022) are optimization based methods which require about 100 seconds of inference time using a single NVIDIA 2080Ti. Moreover, such optimization based methods do not have disentanglement property unlike Multi2One and GlobalDirection leading to significantly worse performance in source identity preservation. The text-image similarity is measured using R CLIP (t, X edited ) from Eq. ( 7), which measures the similarity between text t and the manipulated image. We used total of 57 text descriptions on human facefoot_5 to evaluate the performance based on FFHQ-pretrained generator. We manipulate all 6048 channels in StyleSpace accordingly with optimization based methods which do not prune certain channels as a remedy for entanglement. Despite being an input-agnostic manipulation method, Multi2One is on par with the input-dependent StyleCLIP Local Optimization method (Patashnik et al., 2021) in the CLIP similarity score, R CLIP (t, X edited ) The measurement on the ability to find the unsupervised directions α (100) (only the top 100 channels from unsupervised direction α are modified) is measured using mean squared error in Eq. ( 6). Given I COMPUTATIONAL OVERHEAD StyleCLIP GlobalDirection and Multi2One require preprocessing time to compute ϕ CLIP (•) operation. Such operation is computed using either e i over n channels (StyleCLIP) or α for total of q directions. More specifically, StyleGAN-ADA generator has a fixed StyleSpace dimension consisting of n channels which varies by the resolution of output image. 1024 resolution has 6048 channels, while 512 resolution has 5952 and 256 has 5760. Therefore the total computational time is measured by the number of ϕ CLIP (•) operations calculated over n or q directions using 100 instances of source images to compute the expectation. In Tab. 7 we measure the average computation seconds per iteration using NVIDIA 2080Ti which is equal for both methods. Since one iteration means a single ϕ CLIP (•) operation, comparing the number of iterations shows that our method requires less computation time. Moreover, our method introduces an additional dictionary learning phase which only takes about 15 minutes. Therefore we conclude that our method is slightly more efficient in terms of time to construct the dictionary D. 

J LIMITATIONS

The flexibility and diversity of text is not fully exerted due to the limited encoding ability and the deterministic representation of CLIP (Radford et al., 2021) . For example, it is possible to observe change in pose when using the eigenvectors (vectors that correspond to large eigenvalues often shows large change in image) from unsupervised methods. However, since CLIP is known to have limited ability in terms of understanding the relative position, it is impossible to modify an image using text guidance such as 'move right' and 'turn left'. Moreover, we observe that CLIP encoder fails to represent some of the text instance such as 'Smiling eyes' (Fig. 7 ) while it is shown in Fig. 1 (a) that change in eyes happens using text 'Smile'. Since it is proven that the editing direction of 'smiling eyes' could be discovered when provided with appropriate text, we conjecture that the CLIP encoder fails to represent some text input faithfully leading to unsuccessful manipulation results.



In order to distinguish it from global direction, which means finding input agnostic directions, we express the method proposed in StyleCLIP in this way We performed the experiment here using the GlobalDirection method and test examples provided on the StyleCLIP official site: https://github.com/orpatashnik/StyleCLIP A more detailed explanation on representing the samples as directions for an angle histogram is deferred to Appendix B due to space constraints. This is based on the observation that the change in image is not significant for the directions with lower eigenvalues in SeFa. The original paper StyleCLIP refers to the magnitude of change in StyleSpace as α and the disentanglement level as β which we substitute by number of channel that are changed. The construction of 57 random texts for manipulation on FFHQ pre-trained StyleGAN-ada is described in Appendix A



Figure 1: (a) Manipulation by the 70-th direction from GANspace generates 'a man with wide smile'. GlobalDirection (GD), highlighted in red, fails to reproduce similar result even when provided with various text guidances. (b) Manipulation results by randomly selected text, demostrating that GD has insufficient manipulation ability. Same number of channels are manipulated in both methods.

t e x i t s h a 1 _ b a s e 6 4 = " U w 0 W + U b P H b J 3 N Q X Z I f I p x C 7 S F Z o = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P aU D b b S b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T A X X x v O + n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 S R T D J s s E Y n q h F S j 4 B K b h h u B n V Q h j U O B 7 X B 8 O / P b T 6 g 0 T + S j m a Q Y x H Q o e c Q Z N V Z 6 0 H 2 / X 6 1 5 r j c H W S V + Q W p Q o N G v f v U G C c t i l I Y J q n X X 9 1 I T 5 F Q Z z g R O K 7 1 M Y 0 r Z m A 6 x a 6 m k M e o g n 5 8 6 J W d W G Z A o U b a k I X P 1 9 0 R O Y 6 0 n c W g 7 Y 2 p G e t m b i f 9 5 3 c x E N 0 H O Z Z o Z l G y x K M o E M Q m Z / U 0 G X C E z Y m I J Z Y r b W w k b U U W Z s e l U b A j + 8 s u r p H X h + l f u 5 f 1 l r e 4 W c Z T h B E 7 h H H y 4 h j r c Q Q O a w G A I z / A K b 4 5 w X p x 3 5 2 P R W n K K m W P 4 A + f z B w D 8 j Z A = < / l a t e x i t > s1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 z 1 a A Z L O 5 Z R n V o P f l E 1 W G 6 T M n 4 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T J J p x p s s k Y n u h N R w K R R v o k D J O 6 n m N A 4 l b 4 f j 2 5 n f f u L a i E Q 9 4 i T l Q U y H S k S C U b T S g + m r f r X m u d 4 c Z J X 4 B a l B g U a / + t U b J C y L u U I m q T F d 3 0 s x y K l G w S S f V n q Z 4 S l l Y z r k X U s V j b k J 8 v m p U 3 J m l Q G J E m 1 L I Z m r v y d y G h s z i U P b G V M c m W V v J v 7 n d T O M b o J c q D R D r t h i U Z R J g g m Z / U 0 G Q n O G c m I J Z V r Y W w k b U U 0 Z 2 n Q q N g R / + e V V0 r p w / S v 3 8 v 6 y V n e L O M p w A q d w D j 5 c Q x 3 u o A F N Y D C E Z 3 i F N 0 c 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 1 1 w j c 0 = < / l a t e x i t > sn + ⋯ < l a t e x i t s h a 1 _ b a s e 6 4 = " U w 0 W + U b P H b J 3 N Q X Z I f I p x C 7 S F Z o = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 1 G P B i 8 e K 9 g P a U D b b S b t 0 s w m 7

r p w / S v 3 8 v 6 y V n e L O M p w A q d w D j 5 c Q x 3 u o A F N Y D C E Z 3 i F N 0 c 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 1 1 w j c 0 = < / l a t e x i t > sn Style GAN Ours < l a t e x i t s h a 1 _ b a s e 6 4 = " y Q

P H 6 A e M t 0 9 k + 4 7 0 < / l a t e x i t > (sec.4) ⋯ < l a t e x i t s h a 1 _ b a s e 6 4 = " U w 0 W + U b P H b

r p w / S v 3 8 v 6 y V n e L O M p w A q d w D j 5 c Q x 3 u o A F N Y D C E Z 3 i F N 0 c 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A 1 1 w j c 0 = < / l a t e x i t > sn < l a t e x i t s h a 1 _ b a s e 6 4 = " U U B 3 P u 5 t 7 r v k / I 2 c e M o P Q S c 3 j

g 1 4 6 o v U l I S c B n 1 4 E d s 1 p O D P A d e I u S A 0 s 0 A 7 s b y 8 S O E s I 1 5 gh p Q a u k 2 o / R 1 J T z M i 0 4 m W K p A i P 0 Z A M D O U o I c r P Z 7 d M 4 b l R I h g L a R 7 X c K Y u d + Q o U c V 2 p j J B e q R W v U L 8 z x t k O r 7 x c 8 r T T B O O 5 4 P i j E E t Y B E M j K g k W L O J I Q h L a n a F e I Q k w t r E V z E h u K s nr 5 P u Z c O 9 a j T v m 7 W W u 4 i j D M 5 A F d S B C 6 5 B C 9 y B N u g A D J 7 A C 3 g D 7 9 a z 9 W p 9 W J / z 0 p K 1 6 D k F f 2 B 9 / Q K O a p t V < / l a t e x i t > (e n ) < l a t e x i t s h a 1 _ b a s e 6 4 = " m C 9 Q 4 6 s 4 4

B l P x o v x b n z M W 1 e M c u Y I / C n j 8 w e v H p t t < / l a t e x i t > D Multi2One < l a t e x i t s h a 1 _ b a s e 6 4 = " I G x 3 / u b o D 2 u d Z z 6 9 7 l z W E B u c W

t e x i t s h a 1 _ b a s e 6 4 = " I G x 3 / u b o D 2 u d Z z 6 9 7 l z W E B u c W

of the introduction. Toward this we show two empirical findings, which corresponds to Fig. 1(a) and Fig. 1(b), respectively.

0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b T

r v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D

0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b T

4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q

r v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D E + n z B u Z j 5 O 1 W b

r J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o Y

0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b TU S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡< l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b

D H T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q

8 n j K M I J n M I 5 e H A F d b i D B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P f A e M v Q = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J C E A F P n 1 m 4= " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d N F P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R or v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D E + n z B u Z j 5 O 1 W b

r J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o Y

0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b T U S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b /

1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d

8 n j K M I J n M I 5 e H A F d b i D B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P f A e M v Q = = < / l a t e x i t > 0< l a t e x i t s h a 1 _ b a s e 6 4 = " U P v c B 3 v L z w 3 T n n t t 3 J C E A F P n 1 m 4 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9L B b B i y W R o h 6 L X j x W M G 2 h D W W z 3 b R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m h Q l n 2 r j u t 1 N a W 9 / Y 3 C p v V 3 Z 2 9 / Y P q o d H b R 2 n i l C f x D x W 3 R B r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d N F P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R o r v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D E + n z B u Z j 5 O 1 W b e i 6 Z X k U C k z s = " > A A A B 7 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 0 t R T 0 W v X i s Y D + g X U o 2 z b a x 2 W R J s k J Z + h + 8 e F D E q / / H m / / G d L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 8 t E E d o i k k v V D b C m n A n a M s x w 2 o 0 V x V H A a S e Y 3 M 7 9 z h N V m k n x Y K Y x 9 S M 8 E i x k B B s r t b 2 L W j 9 m g 3 L F r b o Z 0 C r x c l K B H M 1 B + a s / l C S J q D C E Y 6 1 7 n h s b P 8 X K M M L p r N R P N I 0 x m e A R 7 V k q c E S 1 n 2 b X z t C Z V Y Y o l M q W M C h T f 0 + k O N J 6 G g W 2 M 8 J mr J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o YZ F k C y h 4 Q 2 u Q d A = " > A A A B 7 n i c b V B N S w M x E J 2 t X 7 V + r X r 0 E i y C F + t u K e q x 6 M V j B f s B 7 V K y a b Y N z W Z D k h X K 0 h / h x Y M i X v0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b T U S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b /

1 b u s 1 u 5 r l f p N H k c R T u A U z s G D K 6 j D H T S g C Q y G 8 A y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y x W M S q E 1 C N g k t s G W 4 E d h K F N A o E P g T j 2 5 n / 8 I R K 8 1 j e m 0 m C f k S H k o e c U W O l p t s v V 9 y q O w d Z J V 5 O K p C j 0 S 9 / 9 Q Y x S y O U h g m q d

8 n j K M I J n M I 5 e H A F d b i D B r S A A c I z v M K b 8 + i 8 O O / O x 6 K 1 4 O Q z x / A H z u c P f A e M v Q = = < / l a t e x i t > 0 < l a t e x i t s h a _ b a s e = " t v N G + y q g g c H n P e A e v o v G

6 X 4 c B B x 9 U I J Z + p y R 4 o C k S 2 n n N n a 4 n c t E / + q 9 R M 5 a D g p D e N E k h D P B w 0 S B m U E s 5 y g T z n B k k 0 U Q Z h T t S v E I 8 Q R l i r N g g r h + 1 L 4 P + l a u l n T q 9 f V U l N f x J E H R + A Y l I E J 6 q A J L k A b d A A G 9 + A R P I M X 7 U F 7 0 l 6 1 t 7 k 1 p y 1 6 D s E P a B 9 f a E G h q w = = < / l a t e x i t > CLIP (ei) < l a t e x i t s h a 1 _ b a s e 6 4 = " V o e h U a f f i 3B S D E F 2 + w l Q a 2 O n u m 8 = " > A A A C H 3 i c d V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o m 5 C m t a 2 7 Q j c K L i r Y W m h C m Ey m 7 d D J g 5 m J U E L / x I 2 / 4 s a F I u K u f + O k r W B F D w x z O P d c 7 r 3 H i x k V 0 j R n W m 5 t f W N z K 7 9 d 2 N n d 2 z / Q D 4 + 6 I k o 4 J h 0 c s Y j 3 P C Q I o y H p S C o Z 6 c W c o M B j 5 N 4 b t 7 L 6 /

r y p m k v m G G 0 2 6 i K B Y h p 5 1 w c p f 7 n S e q N I v l o 5 k m N B B 4 J F n E C D a 5 d NF P 2 K B a c + v u H G i V e A W p Q Y H W o P r V H 8 Y k F V Q a w r H W P c 9 N T J B h Z R j h d F b p p 5 o m m E z w i P Y s l V h Q H W T z W 2 f o z C p D F M X K l j R or v 6 e y L D Q e i p C 2 y m w G e t l L x f / 8 3 q p i W 6 C j M k k N V S S x a I o 5 c j E K H 8 c D Z m i x P C p J Z g o Z m 9 F Z I w V J s b G U 7 E h e M s v r 5 L 2 Z d 2 7 q j c e G r X m b R F H G U 7 g F M 7 B g 2 t o w j 2 0 w A c C Y 3 i G V 3 h z h P P i v D s f i 9 a S U 8 w c w x 8 4 n z + 8 T o 4 N < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " z 1 D E + n z B u Z j 5 O 1 W b

r J e 9 u f i f 1 0 t M e O 2 n T M S J o Y I s F o U J R 0 a i + e t o y B Q l h k 8 t w U Q x e y s i Y 6 w w M T a g k g 3 B W 3 5 5 l b R r V e + y W r + v V x o 3 e R x F O I F T O A c P r q A B d 9 C E F h B 4 h G d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q O h B 4 6 G < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " W O i 6 5 d d E 6 e B n o Y

0 9 3 v w 3 p u 0 e t P X B w O O 9 G W b m h Z I z b T z v 2 y m s r W 9 s b h W 3 S z u 7 e / s H 7 u F R S y e p I r R J E p 6 o T o g 1 5 U z Q p m G G 0 4 5 U F M c h p + 1 w f D f z 2 0 9 U a Z a I R z O R N I j x U L C I E W y s 1 L 7 w L 6 s 9 y f p u 2 a t 4 c 6 B V 4 u e k D D k a f f e r N 0 h I G l N h C M d a d 3 1 P m i D D y j D C 6 b T U S z W V m I z x k H Y t F T i m O s j m 5 0 7 R m V U G K E q U L W H Q X P 0 9 k e F Y 6 0 k c 2 s 4 Y m 5 F e 9 m b i f 1 4 3 N d F N k D E h U 0 M F W S y K U o 5 M g m a / o w F T l B g + s Q Q T x e y t i I y w w s T Y h E o 2 B H / 5 5 V X S q l b 8 q 0 r t o V a u 3 + Z x F O E E T u E c f L i G O t x D A 5 p A Y A z P 8 A p v j n R e n H f n Y 9 F a c P K Z Y / g D 5 / M H C 0 a O v Q = = < / l a t e x i t > 1/2⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " r C C I 4 D R h 7 z / 4 + S 6 c w k L q W E e 4 z n A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 k 3 b p Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b

y v 8 O Y I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B S q 4 3 W < / l a t e x i t > ⇡ < l a t e x i t s h a 1 _ b a s e 6 4 = " S v E w 9 j + G 6 / n N S Z e b h l v Q w R v l S i I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y

Figure 3: (a) Histogram of CLIP Gs(s + ŝ) -CLIP Gs(s -ŝ) using single-channel manipulation ei or multiple channel manipulation α (50) . (b) Manipulation results using channel indices 4520, 4426. GlobalDirection (denoted as StyleCLIP GD) manipulated with text 'white hair'.

Figure 5: Manipulation results using FFHQ-pretrained StyleGAN2-ADA via text guidance on our method and StyleCLIP GlobalDirection. The samples are manipulated with increasing number of channels that are modified leading to larger changes.

Figure6: Manipulation by the direction with highest cosine similarity. We find the unsupervised direction with highest similarity given text 'Little mermaid' and 'Joker smile'. Then we manipulate the image using top-k channels from the found direction.

The angles are derived as θ i = arctan2(b i , a i ) for all samples which range between[-π, +π]. The histogram of 100 angles are plotted with the angle of averaged difference in R 2 with the angle arctan2(b, a) is emphasized as a red line in Fig.3(a).

Figure 10: Manipulation results showing the disentanglement of Multi2One when using 100, 500, 1000 channels, and without pruning channels. StyleCLIP GlobalDirection manifests image collapse when using over 500 channels while Multi2One shows robustness against entanglement.

Figure 11: Manipulation results using AFHQ cat and AFHQ dog pretrained StyleGAN2-ADA via text guidance on our method Multi2One and StyleCLIP GlobalDirection. The samples are manipulated with increasing number of channels that are modified leading to larger changes.

Figure 12: Manipulation results using LSUN car and LSUN church pretrained StyleGAN2-ADA via text guidance on our method Multi2One and StyleCLIP GlobalDirection. The samples are manipulated with increasing number of channels that are modified leading to larger changes.

Figure 13: Additional manipulation results on AFHQdog-pretrained StyleGAN-ADA. Equal configuration for both GlobalDirection (denoted as GD) and Multi2One.

Figure 16: Additional manipulation results on LSUNchuch-pretrained StyleGAN-ADA. Equal configuration for both GlobalDirection (denoted as GD) and Multi2One. Used image instead of text as a guidance encoded into CLIP with encoder E(•).

Mean Squared Error (Eq. (6); ↓) measured in StyleSpace. We manipulate total of 100 channels using both GlobalDirection and Ours. Note that α is α (100) in this metric.

01)

Evaluation on other text-guided image manipulation methods. In Tab. 4, we provide additional results from other image manipulation methods, StyleCLIP GD (GlobalDirection from Patashnik et al. (2021)), StyleCLIP LO (local optimization from Patashnik et al. (2021)) and StyleMC from

Measured time required for a forward pass with batch size 1 in NVIDIA 2080Ti-seconds. SG2 stands for StyleGAN-ADA. The first row show the duration of single iteration in seconds.

ACKNOWLEDGEMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)). This project is also supported by KAIST-NAVER Hypercreative AI Center.

𝑠 !

< l a t e x i t s h a 1 _ b a s e 6 4 = " A j 4 S t w L 9 Q D V z Z R C / l b M U d Z m 3 N V 4 = " > A A A C A X i c b V D L S g M x F M 3 U V 6 2 v U T e C m 2 A R K k i Z k a I u C 2 5 c V r A P 6 I w l k 8 m 0 o Z l k S D J C G e r G X 3 H j Q h G 3 / o U 7 / 8 Z M O w u t H g g 5 n H M v 9 9 4 T J I w q 7 T h f V m l p e W V 1 r b x e 2 d j c 2 t 6 x d / c 6 S q Q S k z Y W T M h e g B R h l J O 2 p p q R X i I J i g N G u s H 4 K v e 7 9 0 Q q K v i t n i T E j 9 G Q 0 4 h i p I 0 0 s A + 8 Q L B Q T W L z Q Q + x Z I T u s p p 7 M h 3 Y V a f u z A D / E r c g V V C g N b A / v V D g N C Z c Y 4 a U 6 r t O o v 0 M S U 0 x I 9 O K l y q S I D x G Q 9 I 3 l K O Y K D + b X T C F x 0 Y J Y S S k e V z D m f q z I 0 O x y p c 0 l T H S I 7 X o 5 e J / X j / V 0 a W f U Z 6 k m n A 8 H x S l D G o B 8 z h g S C X B m k 0 M Q V h S s y v E I y Q R 1 i a 0 i g n B X T z 5 L + m c 1 d 3 z e u O m U W 2 e F n G U w S E 4 A j X g g g v Q B N e g B d o A g w f w B F 7 A q / V o P V t v 1 v u 8 t G Q V P f v g F 6 y P b + x t l n s = < / l a t e x i t > ↵ (1) < l a t e x i t s h a 1 _ b a s e 6 4 = " X 7 7 t D T i F g 4 R e / z 8 B a S O h X Z N x 5 y 8 = " > A A A C A X i c b V D L S g M x F M 3 U V 6 2 v U T e C m 2 A R K k i Z K U V d F t y 4 r G A f 0 B l L J p N p Q z P J k G S E M t S N v + L G h S J u / Q t 3 / o 2 Z t g t t P R B y O O d e 7 r 0 n S B h V 2 n G + r c L K 6 t r 6 R n G z t L W 9 s 7 t n 7 x + 0 l U g l J i 0 s m J D d A C n C K C c t T T U j 3 U Q S F A e M d I L R d e 5 3 H o h U V P A 7 P U 6 I H 6 M B p x H F S B u p b x 9 5 g W C h G s f m g x 5 i y R D d Z 5 X a 2 a R v l 5 2 q M w V c J u 6 c l M E c z b 7 9 5 Y U C p z H h G j O k V M 9 1 E u 1 n S G q K G Z m U v F S R B O E R G p C e o R z F R P n Z 9 I I J P D V K C C M h z e M a T t X f H R m K V b 6 k q Y y R H q p F L x f / 8 3 q p j q 7 8 j P I k 1 Y T j 2 a A o Z V A L m M c B Q y o J 1 m x s C M K S m l 0 h H i K J s D a h l U w I 7 u L J y 6 R d q 7 o X 1 f p t v d w 4 n 8 d R B M f g B F S A C y 5 B A 9 y A J m g B D B 7 B M 3 g F b 9 a T 9 W K 9 W x + z 0 o I 1 7 z k E f 2 B 9 / g D t 8 5 Z 8 < / l a t e x i t > ↵ (2) < l a t e x i t s h a 1 _ b a s e 6 4 = " C d F D e 6 9 K G T z g x d N m / 5 H j j w 2 y I O g = " > A A A C A X i c b V D L S g M x F M 3 U V 6 2 v U T e C m 2 A R K k i Z 0 a I u C 2 5 c V r A P 6 I w l k 8 m 0 o Z l k S D J C G e r G X 3 H j Q h G 3 / o U 7 / 8 Z M 2 4 W 2 H g g 5 n H M v 9 9 4 T J I w q 7 T j f V m F p e W V 1 r b h e 2 t j c 2 t 6 x d / d a S q Q S k y Y W T M h O g B R h l J O m p p q R T i I J i g N G 2 s H w O v f b D 0 Q q K v i d H i X E j 1 G f 0 4 h i p I 3 U s w + 8 Q L B Q j W L z Q Q + x Z I D u s 8 r 5 y b h n l 5 2 q M w F c J O 6 M l M E M j Z 7 9 5 Y U C p z H h G j O k V N d 1 E u 1 n S G q K G R m X v F S R B O E h 6 p O u o R z F R P n Z 5 I I x P D Z K C C M h z e M a T t T f H R m K V b 6 k q Y y R H q h 5 L x f / 8 7 q p j q 7 8 j P I k 1 Y T j 6 a A o Z V A L m M c B Q y o J 1 m x k C M K S m l 0 h H i C J s D a h l U w I 7 v z J i 6 R 1 V n U v q r X b W r l + O o u j C A 7 B E a g A F 1 y C O r g B D d A E G D y C Z / A K 3 q w n 6 8 V 6 t z 6 m p Q V r 1 r M P / s D 6 / A H v e Z Z 9 < / l a t e x i t > ↵ (3) < l a t e x i t s h a _ b a s e = " E W V V y C V X f h D H f Z D c W / X f l o = " > A A A B H i c b V D L S g N B E O y N r x h f U Y e B o M g K G F X g n o M e P G Y g H l A s o T Z S W y Z n Z m Z k V Q s g X e P G g i F c / y Z t / y T Z g y Y W N B R V X R B Y n g r j u t N b W / Y M p v F Z / Y P i o d H T R n i m G D x S J W Y B q F F x i w A j s J o p F E g s B W M m Z + w m V r F M O M E / Y g O J A o Z K Y t e s e S W T n I K v E y U o I M t V x q u P W R q h N E x Q r T u e m x h / Q p X h T O C E J p S N A A l k o a o f Y n O n M w q f R L G y p Y Z K + n p j Q S O t x F N j O i J q h X v Z m n e J z X h r T / h M k k N S r Z Y F K a C m J j M v i Z r p A Z M b a E M s X t r Y Q N q a L M G w K N g R v + e V V r w q e f l S r S q l m c e T h B E h H D y g S r c Q w a w A D h G V h z X l X p x P R m n O y m W P A + f z B V j K I = < / l a t e x i t > + Source Image < l a t e x i t s h a 1 _ b a s e 6 4 = " + I R D 9 X X B W H A O 8 s I N W X T y F k Jr x y i t r W 9 s b p W 3 K z u 7 e / s H 5 u F R T y a p w K S L E 5 a I e w 9 J w m h M u o o q R u 6 5 I C j y G O l 7 4 + v c 7 z 8 S I W k S 3 6 k J J 2 6 E R j E N K E Z K S 0 P z 3 P E S 5 s t J p D / o 8 J D W F g T E e I g e s p p d n 9 a H Z t V q W D P< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 5 t t E + u u 2 v y 0 3 j d c y X t s y r + 1 E 6r x y i t r W 9 s b p W 3 K z u 7 e / s H 5 u F R T y a p w K S L E 5 a I e w 9 J w m h M u o o q R u 6 5 I C j y G O l 7 4 + v c 7 z 8 S I W k S 3 6 k J J 2 6 E R j E N K E Z K S 0 P z 3 P E S 5 s t J p D / o 8 J D W F g T E e I g e s l q z P q 0 P z a r V s G a A q 8 Q u< l a t e x i t s h a 1 _ b a s e 6 4 = " t K X l p z 8 + q 2 8 G o 1 j yr C e r Z d l a 8 5 a z R z D N 1 i v H y r S j S Y = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " F T S u Y 5 7r C e r Z d l a 8 5 a z R z D N 1 i v H y r S j S Y = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " F T S u Y 5 7r C e r Z d l a 8 5 a z R z D N 1 i v H y r S j S Y = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " c 2 q E 4 W 0 / f o 0 n r s i 9 n j t p N k N e N e U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K U Y 8 F L x 4 r 2 g 9 o Q 9 l s N u 3 S z S b s T o R S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J X C o O t + O 4 W N z a 3 t n e J u a W / / 4 PN w Y T w + S n 5 n 7 T L t l u z q 8 1 q q V 5 e x Z G H E z i F c 3 D h A u p w D Q 1 o A Q O E e 3 i E J + v W e r C e r Z d l a 8 5 a z R z D N 1 i v H y r S j S Y = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " i F 𝑠 " 𝑠 # 𝑠 ! 𝑠 ! < l a t e x i t s h a 1 _ b a s e 6 4 = " X 7 7 t D T i F g 4 R e / z 8 B a S O h X Z N x 5 y 8 = " > A R H q h 5 L x f / 8 7 q p j q 7 8 j𝑠 " 𝑠 # 𝑠 ! < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 5 t t E + u u 2 v y 0 3 j d c y X t s y r + 1 E 6 g = " > AR q S z r x y i t r W 9 s b p W 3 K z u 7 e / s H 5 u F R T y a p w K S L E 5 a I e w 9 J w m h M u o o q R u 6 5 I C j y G O l 7 4 + v c 7 z 8 S I W k S 3 6 k J J 2 6 E R j E N K E Z K S 0 P z 3 P E S 5 s t J p D / o 8 J D W F g T E e I g e s l q z P q 0 P z a r V s G a A q 8 Q u< l a t e x i t s h a 1 _ b a s e 6 4 = " +R q S z r x y i t r W 9 s b p W 3 K z u 7 e / s H 5 u F R T y a p w K S L E 5 a I e w 9 J w m h M u o o q R u 6 5 I C j y G O l 7 4 + v c 7 z 8 S I W k S 3 6 k J J 2 6 E R j E N K E Z K S 0 P z 3 P E S 5 s t J p D / o 8 J D W F g T E e I g e s p p d n 9 a H Z t V q W D Pe e j V f j w / i c l 5 a M o u c Y L M D 4 + g V 6 0 J 3 W < / l a t e x i t > (↵ (1) ) < l a t e x i t s h a 1 _ b a s e 6 4 = " t K X l p z 8 + q 2 8 G o 1 j y R H q h 5 L x f / 8 7 q p j q 7 8 jy C Z / A K 3 q w n 6 8 V 6 t z 6 m p Q V r 1 r M P / s D 6 / A H v e Z Z 9 < / l a t e x i t > ↵ (3) 𝑠 " 𝑠 # 𝑠 ! < l a t e x i t s h a 1 _ b a s e 6 4 = " u A x w 6 6 y X 3 h On the other hand, it is observable on the right side of Fig. 3 (a), that manipulation by unsupervised direction α (50) , shows much consistent results.Moreover, in Fig. 3 (b), we show an example where single channel alone has completely different meaning from the manipulation by the combination of multiple channels. This supports our claim that the interaction between multiple channels should be considered, rather than individually encoding single channel each. In Fig. 3 (b), manipulation by 4520-th or 4426-th do not show "completely white hair" while the combination of the two shows white hair. Since GlobalDirection fails to learn that 4520-th also has a role for creating white hair, it fails to successfully modify an image when text guidance is given as 'white hair'. 

Source

Ours GD 

C EXAMPLES OF ENTANGLEMENT USING UNSUPERVISED METHODS

In Fig. 8 , we demonstrate the effect of using limited number of channels among all non-zero channels in unsupervised direction α ∈ R n .Since StyleGAN generator consists of progressiveGAN structure where the resolution of generated image progressively increases, it is possible to classify the generator structure into three blocks: coarse, medium and fine. We apply the unsupervised directions on each of the blocks. Moreover, by filtering out to only use k manipulated channels, we show that the source image shows less drastic change. By manipulating all channels, which is shown at the rightmost side of the figure, the source image is drastically changed showing entanglement. Fig. 10 shows the manipulation results on Multi2One and StyleCLIP GlobalDirection using 100, 500, 1000 channels. In addition, the manipulation results without channel pruning strategy is shown as well. Using over 500 channels is required for large structural change of the object in source image. However, StyleCLIP GlobalDirection shows signs of severe image collapse before creating an image that perfectly satisfies the given text guidance. For example, an image at the first row with 100 channels manipulated do not seem 'young' enough and the extent of manipulation is not improved with using more channels since the image collapse and entanglement becomes a dominating factor. On the other hand, Multi2One shows consistently reasonable and disentangled images even when using all channels in STYLESPACE. Published as a conference paper at ICLR 2023 an ϕ CLIP (•) operation which maps the manipulation effect of α (100) , it is desirable to find a direction that only modifies the 100 channels that were non-zero entries among n channels in α (100) while the other n -100 channels are not modified. Therefore, MSE α, α| =0 measure the level of disentanglement which measures the distance between the n -100 channels in α and α (100) . The result in Tab. 4 shows that ours, Multi2One, scored the lowest MSE α, α| =0 proving to be the most disentangled among the four text-guided manipulation methods.

G ABLATION STUDY ON LOSS TERM

We report the similarity score as R CLIP (t, X edited ) -R CLIP (t, X), to show the increase in similarity value when modified using the text guidance t. Note that ID loss tends to increase as the image is manipulated properly, therefore it is not a robust metric for evaluating the entanglement level. In Fig. 17 , we visualize the manipulation results to show the effect of the second loss term when removed by λ = 0 compared to when the second loss term is included by λ = 0.01. We observe that using the additional loss term does help finding novel directions which are not present in the unsupervised directions, leading to better manipulative ability.

Young Teen

Eyeglasses Pink Hair Little Mermaid Male Green Hair Source< l a t e x i t s h a 1 _ b a s e 6 4 = " m m b P J U r i K g t U k 4 Q e l m q q s r 1 9 

H ABLATION STUDY ON THE EFFECT OF UNSUPERVISED METHODS

The dictionary learning process of Multi2One employs the directions α ∈ R n from unsupervised methods (Shen & Zhou, 2021; Härkönen et al., 2020) . Therefore, we conduct an ablation study on the effect of using unsupervised directions by comparing the two cases where directions α come from supervised method (Shen et al., 2020) and unsupervised methods (Shen & Zhou, 2021; Härkönen et al., 2020) . Fig. 18 shows that using directions from supervised methods are less capable of manipulating images to have desired change. This could also be observed by quantitative measure R CLIP (t, X edited ) in Tab. 6 since the similarity score between manipulated image and the text guidance is significantly lower using directions from supervised method compared to using unsupervised method. This is due to the limited number of directions that could be found by supervised method, which amounts to only 14 using InterFaceGAN (Shen et al., 2020) . 2021). The substitute the ground truth directions in the dictionary learning process with directions from InterFaceGAN (Shen et al., 2020) . We manipulate same number of channels with equal manipulation strength for both. 

