CLIP-PAE: PROJECTION-AUGMENTATION EMBED-DING TO EXTRACT RELEVANT FEATURES FOR A DIS-ENTANGLED, INTERPRETABLE, AND CONTROLLABLE TEXT-GUIDED IMAGE MANIPULATION

Abstract

Recently introduced Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.

1. INTRODUCTION

In text-guided image manipulation, the system receives an image and a text prompt and is tasked with editing the image according to the text prompt. Such tasks have received high research attention due to the great expressive power of natural language. Recently introduced and increasingly popular Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) is a technique to achieve this by embedding images and texts into a joint latent space. Combined with generative techniques such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Diffusion (Ho et al., 2020; Dhariwal & Nichol, 2021) , CLIP has been utilized to develop several high-quality image manipulation methods (e.g., Ling et al., 2021; Zhang et al., 2021; Antal & Bodó, 2021; Khodadadeh et al., 2022) , where the image is optimized to be similar to the text prompt in the CLIP joint space. There are three important but difficult to satisfy properties when performing image manipulation: disentanglement, interpretability, and controllability. Disentanglement means that the manipulation should only change the features referred to by the text prompt and should not affect other irrelevant attributes (Wu et al., 2021; Xu et al., 2022) . Interpretability means that we know why/how an edit to the latent code affects the output image and thus we understand the reasoning behind each model decision (Doshi-Velez & Kim, 2018; Miller, 2019) , or that the model can extract relevant information from the given data (Murdoch et al., 2019) . Finally, controllability is the ability to control the intensity of the edit (You et al., 2021; Park et al., 2020; Li et al., 2019) for individual factors and hence is tightly related to disentanglement. In CLIP-based text-guided image manipulation methods, since both the latent space of the generative network and the embedding space of CLIP extensively compress information (e.g., an 1024 × 1024 image contains 3 × 1024 2 dimensions, whereas a typical StyleGAN (Karras et al., 2019) and CLIP both only have 512-dimensional latent spaces), the manipulation in the latent space is akin to a black box. Gabbay et al. (2021) argue that a GAN fails at disentanglement because it only focuses on localized features. Zindancıoglu & Sezgin (2021) also show that several action units in StyleGAN (Karras et al., 2019) are correlated even if they are responsible for semantically distinct attributes in the image. In addition, we found that although CLIP embeddings of images and texts share the same space, they actually reside far away from each other (see Section 3.1), which can lead to undesired artifacts in the generated images such as unintended editing or distortion of facial identity (see Figure 1 ). Finally, most existing methods for CLIP-based text-guided image manipulation do not allow for free control of the magnitude of the edit. As a result, how to perform image manipulation in a disentangled, interpretable, and controllable way with the help of text models remains a hard and open problem. In this paper, we introduce a technique that can be applied to any CLIP-based text-guided image manipulation method to yield a more disentangled, interpretable, and controllable performance. Rather than optimizing an image directly towards the text prompt in the CLIP joint latent space, we propose a novel type of CLIP embedding, projection-augmentation embedding (PAE), as the optimization target. PAE was motivated by two empirical observations. First, the images do not overlap with their corresponding texts in the joint space. This means that a text embedding does not represent the embedding of the true target image that should be optimized for. Therefore, directly optimizing an image to be similar to the text in the CLIP space results in undesirable artifacts or changes in irrelevant attributes. Second, a CLIP subspace constructed via embeddings of a set of relevant texts can constrain the changes of an image with only relevant attributes. To construct a PAE, we first project the embedding of the input image onto a corpus subspace constructed by relevant texts describing the attributes we aim to disentangle, and record a residual vector. Next, we augment this projected embedding in the subspace with the target text prompt, allowing for a user-specified augmenting power to control the intensity of change. Finally, we add back the residual to reconstruct a vector close to the "image region" of the joint space. We demonstrate that the PAE is a better approximation to the embedding of the true target image. With PAE, we achieve better interpretability via an explicit construction of the corpus subspace. We achieve better disentanglement since the subspace constrains the changes of the original image with only relevant attributes. We achieve free control of the magnitude of the edit via a user-specified augmenting power. PAE is easy to use, quick to pre-compute, and can be incorporated into any CLIP-based latent manipulation algorithms to improve performance. In short, we highlight the three major contributions of this paper: 1. (Section 3) We perform a series of empirical analyses of the CLIP space and its subspaces, identifying i) the limitations of using a naive CLIP loss for text-guided image editing; and ii) several unique properties of the CLIP subspace. 2. (Section 4) Based on our findings in Section 3, we propose the projection-augmentation embedding (PAE) as a better approximation to the embedding of the true target image.

3.. (Section 5)

We demonstrate that employing PAE as an alternative optimization target facilitates a more disentangled, interpretable, and controllable text-guided image manipulation. This is validated through several text-guided semantic face editing experiments where we integrate PAE into a set of state-of-the-art models. We quantitatively and qualitatively demonstrate that PAE boosts the performance of all chosen models with high quality and accuracy.

2. RELATED WORK

Latent Manipulation for Image Editing One popular approach for image manipulation is based on its latent code: the input image is first embedded into the latent space of a pre-trained generative network such as GAN (Goodfellow et al., 2014) , and then to modify the image, one updates either the latent code of the image, (e.g., Zhu et al., 2016; Ling et al., 2021; Zhang et al., 2021; Antal & Bodó, 2021; Khodadadeh et al., 2022; Creswell & Bharath, 2018; Perarnau et al., 2016; Pidhorskyi et al., 2020; Hou et al., 2022b; Xia et al., 2021; Kocasari et al., 2022; Patashnik et al., 2021; Shen et al., 2020) , or the weights of the network (e.g., Cherepkov et al., 2021; Nitzan et al., 2022; Reddy et al., 2021) to obtain the desired image editing. However, these methods can only alter a set of pre-defined attributes and thus lack flexibility and generalizability. CLIP for Text-Guided Image Manipulation In 2021, Radford et al. proposed Contrastive Language-Image Pre-Training (CLIP) , where an image encoder and a text encoder are trained such that the semantically similar images and texts are also similar in the joint embedding space. The insight of connecting image and text in the same space brings up a wide spectrum of applications in computer vision tasks. For example, Radford et al. provide applications for image captioning, image class prediction, and zero-shot transfer. More sophisticated tasks include language-driven image generation (Ramesh et al., 2022) , zero-shot semantic segmentation (Li et al., 2022) , image emotion classification (IEC) (Deng et al., 2022) , large-scale detection of specific content (González-Pizarro & Zannettou, 2022) , object proposal generation (Shi et al., 2022b) , object sketching (Vinker et al., 2022) and referring expression grounding (Shi et al., 2022a) . CLIP has also been applied to text-guided image manipulation tasks. In this domain, one approach is to edit the latent code of a certain generative network so that the embedding of the generated image is similar to the embedding of the given text in the CLIP space (Kocasari et al., 2022; Patashnik et al., 2021; Xia et al., 2021; Hou et al., 2022a; Ramesh et al., 2022) (see Figure 4a ). However, this straightforward approach sometimes fails to change the desired attributes, or fails to change them in a disentangled way: other unrelated features are also affected (see the comparative study in Sections 5.2 and 5.3). In addition, the proposed methods often fail to exhibit enough interpretability and controllability. This is probably attributed to the separation of images and texts in the CLIP embeddings space, so that optimizing an image embedding towards a text embedding naturally introduces some undesired effects. Although there are some remedies, such as penalizing the change of features in the editing (e.g., (Canfes et al., 2022; Kocasari et al., 2022) ), separating the image into different granule levels (Patashnik et al., 2021) , formulating a constrained optimization problem (Zhu et al., 2016) , partially labeling a set of features (Gabbay et al., 2021) , or updating the parameters of the underlying GAN to preserve features (e.g., (Nitzan et al., 2022; Reddy et al., 2021; Cherepkov et al., 2021) ), they do not collectively achieve a disentangled, interpretable, and controllable manipulation. Additionally, some of them target all features indifferently: they still do not separate the related features from the irrelevant ones. Moreover, most of these techniques only work for specific methods or attributes and lack generalizability, and some of them are too complicated and time-consuming, hence cannot be used in practical large-scale applications.

3. ANALYSIS OF THE CLIP JOINT SPACE AND SUBSPACE

Most CLIP-based text-guided image manipulation algorithms (e.g., Kocasari et al., 2022; Patashnik et al., 2021; Xia et al., 2021; Hou et al., 2022a; Ramesh et al., 2022) follow a general paradigm where certain parameters of an image editing process (such as the latent code or the weights of a generative network) are trained to minimize a cosine similarity loss between the resulting image and a text prompt in the CLIP joint space (see Figure 4a ). However, naively using this loss may introduce undesirable artifacts or unintended changes unrelated to the text prompt, as shown in Figure 1 . We hypothesize that this is due to the discrepancy between the images and the texts in the CLIP joint space, as we will demonstrate in Section 3.1. Therefore, the embedding of the text prompt in the CLIP joint space does not actually represent the embedding of the true target image that should be optimized for. To alleviate this issue, we construct CLIP subspaces with desirable properties (see Section 3.2) leading to our proposed projection-augmentation embedding (see Section 4).

3.1. NON-OVERLAPPING IMAGE AND TEXT EMBEDDINGS

As an empirical demonstration, we collect over 1500 face images and 1500 textual descriptions of faces (e.g., emotion, hairstyle or general) and visualize their embeddings in the CLIP joint space using PCA (Jolliffe, 2002) in Figure 2 . Note that the visualization is given using Euclidean distance; Although CLIP is trained with cosine similarity, if we normalize all the embeddings, the Euclidean distance and cosine distance give exactly the same ranking because ∥a -b∥ 2 = (a -b) 2 = a 2 + b 2 -2a T b = 2 -2CosSim(a, b). (1) As shown in Figure 2 , the image and the text embeddings lie in two non-overlapping regions. They also exhibit a lower inter-modality cosine similarity compared to the intra-modality similarity, re- gardless of their semantic meanings. For example, the cosine similarity between an image of a dog and the text "dog" is 0.253 and that between an image of a cat and the text "cat" is 0.275. On the other hand, the similarity between a dog image and a cat image reaches 0.841, and that between the two texts "dot" and "cat" is as high as 0.936, much higher than the intra-modality similarity. We record additional evaluations of inter-modality and intra-modality cosine similarities in Appendix A.1. In the general paradigm described above (and see Figure 4a ), the CLIP embedding of the ideal target image is essentially approximated by the embedding of the text description. However, the separation of text and image embeddings clearly invalidates such approximation and thus leads to artifacts (see Figure 1 and more in Sections 5.2 and 5.3).

3.2. SUBSPACES DISTILLING RELEVANT INFORMATION FROM IMAGE EMBEDDINGS

Since the joint space is a vector space, we can construct a subspace of it using a set of relevant text prompts as basis vectors. For example, we can construct an "emotion subspace" using relevant emotion texts. In this section, we explain how such a subspace distills relevant information from images, which can be used to constrain the changes of an image. We also include some experiments in Appendix A.5 to show that the subspace can distill information from texts. We leave the mathematical details of the formulation of the subspace to Section 4, but preview certain properties of the subspace as these properties inspire the formulation of our method. We invited ten participants to record five-second videos of their face changing from neutral to laughing out loud (LOL). We compute the cosine similarity (averaged over all videos) of each frame to the first frame or to the text "a happy face" in the CLIP space, or after projecting them onto the emotion or hairstyle subspace. The results are shown in Figure 3 . In Figure 3a , the similarity decreases much faster in the emotion subspace compared to the CLIP space and the hairstyle subspace. A similar pattern shows up in Figure 3b -the similarity increases much faster in the emotion subspace. Since the subspace has a lower dimensionality than the original space, these observations indicate that the information regarding the irrelevant attributes is discarded during the projection. Hence the changes of the hairstyle in the facial images have a low effect influencing the similarity -the emotion subspace only distills emotional attributes from the image and discards others. This inspires the formulation of our method: if we make changes to the embedding of the original image within a subspace, it would only induce a change of attributes related to the subspace. As a result, these changes are disentangled. 

4. PROJECTION-AUGMENTATION EMBEDDING

Motivated by our findings in Section 3, we propose the projection-augmentation embedding (PAE) as a better approximation to the embedding of the true target image. There are three objectives we aim to fulfill when constructing such an embedding: i) it should be closer to the image region than the text is; ii) it should be guided by the target text prompt; iii) the guidance should be provided within the subspace so the changes are disentangled.

4.1. OVERVIEW

Given an input image I and a text prompt T , we construct the PAE, denoted as E W (I, T, α), as follows. First, we obtain the embeddings of I and T , denoted as e I and e T , via the CLIP image and text encoders (Radford et al., 2021) , then we project e I onto a corpus subspace W constructed with texts describing the attributes we aim to disentangle: w = P W (e I ), where P W is the projection operation onto the subspace W. We will explain the details of P W and the construction of W in Section 4.2. Next, we record a residual vector r of the projection: r = e I -w. After that, we augment the influence of the text prompt T to the projected embedding w, and finally, add back the residual vector r to construct the final embedding: E W (I, T, α) = A W,T (w, α) + r, where A W,T is the augmentation operation in W according to T with a controllable augmenting power α, as will be explained in details in Section 4.3. A graphic illustration of PAE can be found in Figure 4c . In response to our aforementioned three objectives, we highlight that we add back the residual to ensure that the PAE is close to the image region rather than the text region, that we apply the augmentation to ensure that the PAE is guided by the target text prompt, and that we apply the projection to ensure that the guidance is provided via the subspace so the changes of the image are disentangled (recall Section 3.2). As we will see in Section 4.2, the construction of the subspace with explicit selections of relevant prompts makes our approach more interpretable compared to black-box latent manipulations. Our introduction of the augmenting power (see Section 4.3) makes it possible to freely control the magnitude of the change to the image. PAE can be integrated into any CLIP-based text-guided image manipulation algorithms in place of the text embedding in the final loss function, as shown in Figure 4b . In Figure 2 we also include corresponding PAEs generated by the same face images and facial descriptions in Section 3.1. We see that compared to the text embeddings, PAEs are indeed closer (a) A general paradigm followed by most CLIPbased text-guided image manipulation algorithms for semantic face editing (e.g., Patashnik et al., 2021; Kocasari et al., 2022; Xia et al., 2021) . Figure 4 to the image embeddings, and that the PAE distribution has a lower variation, suggesting that they contain more specified information. In the following subsections, we introduce the details of the projection and augmentation operations.

4.2. PROJECTION

We introduce two options for the projection operation: P GS W and P PCA W . P GS W aims to find the semantic vectors for W as the subspace is a semantically meaningful structure. For example, if we aim to change/disentangle the facial emotions, we can use the text embeddings of a set of basic emotions as the basis vectors. After selecting the basis vectors {b n } n , we apply the Gram-Schmidt process to obtain an orthonormal basis { bn } n , and then project e I onto W by computing a dot product with the basis vectors: P GS W (e I ) := k bT k e I bk . P GS W will fail if there is no apparent semantic basis (e.g., for hairstyle, it is hard to find "basic hairstyles"). P PCA W aims to find the basis of W by constructing a set T consisting of a corpus of relevant text embeddings and performing principal component analysis (PCA) (Jolliffe, 2002) to extract a pre-defined number N of principal components as the basis {b n } N n=1 . Effectively, the space spanned by {b n } N n=1 would be a space approximating W that encompasses all the related texts in the corpus of interest. After we get the basis, the projection is performed the same way as in Eq. ( 5). Other dimension reduction techniques can also be used in replace of PCA, such as kernal PCA (Schölkopf et al., 1997) or t-SNE (Van der Maaten & Hinton, 2008) ). We also experimented with a simpler idea of projecting both e I and e T onto W and perform the optimization in W, instead of augmenting in W and adding back the residual. However, this approach introduced significant artifacts and entangled changes, possibly due to the loss of information during the projection (see Appendix A.4 for more details).

4.3. AUGMENTATION

The augmentation operation directs embedding of the original image towards the target. As discussed above, the augmentation on the projected image w should be guided by the target text prompt.

A simple augmentation operation can be

A W,T (w, α) = w + αe T . (6) However, in our pilot studies, we found that this simple operation results in a PAE too similar to the original embedding, leading to an optimization process barely doing anything (see Appendix A.3). We strengthen the influence of e T on w by weakening the unintended attributes of w -components of w where P W (e T ) has a low value -while preserving the sum of coefficients of w. Mathematically, we first calculate the coefficients c k of w and the coefficients d k of the projected text P W (e T ), expressed under the basis {b n } N n=1 of W established in the previous section: c k = w T b k ; (7) d k = P W (e T ) T b k . Next, we weaken (realized by subtraction) all components of w and then add back the projected text P W (e T ), with a coefficient such that the sum of coefficients is preserved so that the embedding does not deviate too much: A W,T (w, α) := N k=1 (c k -α |c k |) b k + α N k=1 |c k | N k=1 d k P W (e T ), where α ∈ R + controls the augmenting power, contributing to a controllable latent manipulation, demonstrated in Figure 5b . The two terms in Eq. ( 9) are equivalent to weakening the components that are small in the projected text. We also include two more options for projection and three more for augmentation in Appendix A.2, including the case when W is not a linear subspace but a general manifold (so that there is no notion of basis). Future researchers can also develop more effective options suitable for their own specific tasks, and this is where lies the extensibility of our PAE. A preliminary criteria for evaluation and selection of different PAEs are present in Appendix A.3.

5. EXPERIMENT

As a case study, we utilize PAE in a series of text-guided semantic face editing experiments, where we manipulate a high-level (emotion) and a low-level (hairstyle) facial attribute (Patashnik et al., 2021; Lyons et al., 2000) . We also show the results of manipulating other facial attributes such as the size of the eyes and the size of mouth in Appendix A.6, and the results of manipulating non-facial images such as dogs in Appendix A.7.

5.1. DATA AND PROCEDURE

Given randomly generated latent codes, we generate facial images using an implementation of Style-GAN2 (Karras et al., 2020) with an adaptive discriminator augmentation (ADA) and the pre-trained model weights for the FFHQ face dataset (Karras et al., 2017) . We randomly generate initial latent codes, and then filter out invalid faces, leaving 28 images for the task. As explained in Section 4, PAE can be integrated into any CLIP-based text-guided image manipulation algorithms to improve performance. We integrate PAE into StyleMC (Kocasari et al., 2022) , StyleCLIP (Patashnik et al., 2021) , and TediGAN (Xia et al., 2021) , comparing their performance with and without PAE. To further demonstrate the disentangling power of our method and ablate other influencing factors, we additionally apply PAE to a naive image manipulation algorithmwe directly update the latent code of the randomly generated input images before the differentiable StyleGAN2 generator, according to either the naive loss (Figure 4a ) or a loss with PAE (Figure 4b ). Note that this approach does not include any additional supervision losses (such as an identity loss (Deng et al., 2019) ) or other trainable parameters. As we will see in the following two subsections, PAE can achieve identity preservation even without the identity loss. In total, we compare four pairs of models; each pair consists of the original version (Figure 4a ) vs. its +PAE version (Figure 4b ): 1. Naive approach (Nv) vs. Naive+PAE (Nv+); 2. StyleMC (SM) (Kocasari et al., 2022) vs. StyleMC+PAE (SM+); 3. StyleCLIP (SC) (Patashnik et al., 2021) vs. Under review as a conference paper at ICLR 2023 StyleCLIP+PAE (SC+); and 4. TediGAN (TG) TediGAN (Xia et al., 2021) vs. TediGAN+PAE (TG+). Note that each pair naturally forms an ablation study for PAE. The improvement brought by the PAE can be observed by comparing each original model with its +PAE version. To construct the PAEs, for emotion editing, we use P GS emotion , defined via six basic human emotions (Ekman, 1992 ) -happy, sad, angry, fearful, surprised, and disgusted -as the semantic basis; for hairstyle editing, we use P PCA hairstyle defined via 68 hairstyle texts and ten principal components. The augmenting power is automatically selected according to the criteria stated in Appendix A.3.

5.2. QUALITATIVE COMPARISON

A qualitative comparison is shown in Figure 5a . The text prompts are written to the left of each row. Furthermore, in Figure 5b , we fix the text prompt to be "happy" and vary the augmenting power α, showing the controllability of our PAE. Please refer to Appendix A.6 for full results. From the Figure, we can see that in the naive approach, changes are not made in a disentangled way. For example, the hair color and the lighting condition of the face in the first row changed; the identity of the face in the second row changed; etc. These problems are also present in the other three original models. We can also see that some edits do not have high quality or accuracy. Comparing models before and after equipping with PAE, we see that PAE enables a much more disentangled, realistic and accurate face manipulation. =0 =1 =2 =3 =4 =0 =1 =2 =3 =4

5.3. QUANTITATIVE COMPARISON

In Table 1 , we measure the performance of the aforementioned eight models with a quantitative comparison in seven metrics: Fréchet Inception Distance (FID) (Heusel et al., 2017) (measures the quality of manipulated images), Learned Perceptual Image Patch Similarity (LPIPS) (measures the perceptual similarity of manipulated images to the original ones), Identity loss (IL) (uses ArcFace (Deng et al., 2019) to measure the degree of facial identity preservation), disentanglement measurement with a facial-attribute classifier (Dis-C) (uses a facial attribute classifier (Serengil & Ozpinar, 2021) to evaluate the degree of disentanglement by measuring changes in the model classification of irrelevant attributes before and after editing), accuracy measurement with a facial-attribute classifier (Acc-C) (uses the same facial attribute classifier to measure the degree of conformity to the text prompt), disentanglement evaluation from survey (Dis-S) (measures the degree of disentanglement with a survey), and accuracy evaluation from survey (Acc-S) (measures the degree of conformity to the text prompt with a survey). The first three scores are computed over the editing result of 5880 randomly generated images. The middle two scores from classifiers are computed over 48 emotion editing images: Dis-C is calculated from the percentage of the model classification of irrelevant attributes (age and race) that remain unchanged after the manipulation; and Acc-C is calculated from the percentage of output images whose model-predicted emotion is the same as the text prompt. Finally, the two scores from survey are obtained from a user study involving 50 participants of various backgrounds evaluating the editing of 36 randomly generated images. We used 7.0 for α in PAE in all methods. 

6. DISCUSSION, LIMITATION, AND FUTURE WORK

An interesting observation from our experiment in hairstyle editing is that the difficulty of manipulation complies with the real-world situations. For example, all models find it very hard to change women's hairstyle to be bald whereas men are very easy to lose hair. This suggests that the performance of such image manipulation methods is limited by the real-world datasets that our upstream models (StyleGAN2 and CLIP) are trained on. Another limitation of our method is that the concrete projection and augmentation operations need manual selection for the best result. We automated a coarse selection by the criteria introduced in Appendix A.3, but a rigorously defined, numeric metric for disentanglement, accuracy, etc. could benefit the performance. Also, texts that does not associate with an obvious attribute (e.g., a celebrity's name) are harder to project to the subspace.

7. CONCLUSION

In Note that this discrepancy of image embeddings and text embeddings in the CLIP space is general and not restricted to faces. We plot the embeddings of 100 random images and 100 random texts visualized using PCA in Figure 6 and also record the similarity in Table 3 . We note the same observation that image embeddings and text embeddings lie in different regions in C and have lower inter-modality similarity compared to the intra-modality similarity. In this section we present two more options for projection and three more for augmentation. About notations: we give each option a short identifier (or not at all) and add it to the superscript of P W , A W,T or E W . For example, E GS,+ W means the projection-augmentation embedding with a projection onto an orthonormal basis (P GS W ) and an augmentation that preserves the coefficients (A + W,T ). 

A.2.1 PROJECTION

A simpler type of projection, P W also assumes W to be a linear subspace of the CLIP space C and is very similar to P GS W except that it does not orthonormalize the basis vectors. P W (e I ) := k b T k e I b k . ( ) Note that in this case, the dot product with the basis vectors is not a strict projection onto a linear subspace. However, this simple option also worked well in our prior experiments. This is possibly due to the high dimensionality of the CLIP space (512): if we do not have many basis vectors (e.g., < 20), their pairwise dot products tend to be very small and thus are nearly orthogonal to each other already. The second option, P All W , does not assume that W is necessarily a linear subspace. Instead, it can be any manifold. P All W directly stores the set T . By storing embeddings of all related texts, it is effectively sampling and storing the points on the manifold W, and when performing the projection onto a surface, the best approximation would be to pick the point on the surface that is closest to the vector to be projected: P All W (e I ) := arg min e∈T CosSim(e I , e). Naturally, if we obtain a larger T , we sample more points from the manifold and better approximate the projection.

A.2.2 AUGMENTATION

We propose three more option for the augmentation operation: A W,T , A Ex W,T and A ExD W,T . The first one is used with projections P GS W , P W and P PCA W , and the last two with P All . The simplest A W,T adds the text and image together, with a coefficient α controlling the augmenting power: A W,T (w) := w + αe T . (12) However, simply adding up e T may result in a vector too similar to the original embedding, likely resulting in an optimization process barely doing anything. In that case, the augmentation introduced in equation 9 is a better option because it additionally weakens a certain amount on other components (while preserving the sum of coefficients). To avoid the misunderstanding, we denote by A W,T this simplest augmentation and denote by A + W,T the augmentation in equation 9. The last two types of augmentation, A Ex W,T and A ExD W,T , are for P All . Since P All gives an embedding e ∈ T , which is expected to indicate the current feature of the image, if we want to change that feature to be the one specified by the text T , we can simply exchange e with the text embedding e T : A Ex W,T (w) := w + α e T -P All W (e I ) , (13) where α is the augmenting power. The final A ExD W,T is a more robust version than A Ex W,T . Instead of doing the one-for-one exchange as above, we can do one-for-α exchange. More precisely, we still add in α copies of e T , but instead of subtracting α copies of the most similar text embeddings, we subtract the α different, most similar text embeddings (each embedding only subtracted once). Naturally, between the one-for-one exchange in A Ex W,T and one-for-α exchange in A ExD W,T , there exist many other options from one-for-k for any k between 1 and α. We leave this exploration to future work.

A.3 EVALUATION CRITERIA

In the future other researchers may want to develop new types of PAE for their specific tasks. In this section we provide two straightforward criteria to evaluate and select different options of PAE: 1. We need CosSim(E W , e It ) to be as large as possible, where e It is the embedding of the target image, i.e. the ideal target. This is because our goal is to approximate the inaccessible e It . At least, we need CosSim(E W , e It ) > CosSim(e T , e It ) so that we gain something by replacing e T with E W as an optimization target. 2. We need CosSim(E W , e It ) > CosSim(E W , e I ), otherwise PAE is more similar to the original image and the no change towards e It will be made during the optimization. We evaluate all the eight options of PAE (two proposed in Section 4.1 and six in Appendix A.2) in the text-guided facial emotion editing experiment described in Section 5. P GS W is realized by using six basic human emotions (Ekman, 1992) as a semantic basis: "happy", "sad", "angry", "fearful", "surprised", "disgusted"; P PCA W and P All W are implemented with a corpus T consisting of 277 emotion texts found online and ten principal components. We plot the above quantities vs. α in Figure 7 . Each color corresponds to an option of E W . Restating the above criteria, within each color, we need 1. the solid line to be as large as possible and at least be above the orange horizontal line; and 2. the dash-dotted line to be above the black horizontal line. We see that in this particular experiment of facial emotion editing, it is more suitable to use E + W with an α ∈ [5, 15], E GS,+ W with an α ∈ [10, 15], or E PCA,+ W with an α ∈ [2.5, 5]. Note that as the + versions perturb the embedding more, the dash-dotted lines for E + W , E GS,+ W and E PCA,+ W are higher than their non-+ counterparts.

A.4 A FAILED CASE: DOUBLE-PROJECTED EMBEDDING

This section presents the double projected embedding (DPE) that has similar idea of using CLIP subspaces to extract relevant information from the embeddings. DPE is simpler than PAE in that it does not have the second step of augmenting the projected image embedding; instead, it projects both image and text embeddings onto the subspace and directly optimizes the image towards the text in the subspace. We conduct the same text-guided emotion editing experiment as in Section 5 using DPE GS and DPE PCA with 20 principal components. The results are summarized in Figure 8 and Figure 9 , respectively. We can see from the results that although most faces indeed changed the emotion according to the text prompts, we lost the disentanglement: in some manipulation, the lighting condition, the background, the hair, or even the facial identity changed. One possible explanation is that by projecting faces onto the emotion subspace, most information except the emotion has been lost, so we cannot preserve irrelevant attributes. Recall that in constructing PAE, the final step is to add back the residual r to go back to the "image region" (see Eq. ( 4)). We hypothesize that this r exactly contains the irrelevant information that we need to preserve. This example also shows that the necessity and importance of using an optimization target that is close enough to the embedding e I of the original image. Each option is color-coded. In order for W to be effective, we need the solid line to be as large as possible and at least be above the orange horizontal line and the dash-dotted line to be above the black horizontal line. In this experiment, the best three options are [10, 15] ) and E PCA,+ W (α ∈ [2.5, 5]). Note also that as the + versions perturb the embedding more, the dash-dotted lines for + versions are generally higher than their non-+ counterparts.  E + W (α ∈ [5, 15]), E GS,+ W (α ∈

A.5 CLIP SUBSPACE EXTRACTS RELEVANT INFORMATION FROM TEXTS

In this section we conduct two experiments to demonstrate that our emotion subspace W can extract the relevant information from texts. The emotion subspace W is created either using P GS W or P PCA W , but they result in similar observations. We compare the averaged similarity of emotion-animal phrases in the original CLIP space C and in the emotion space W created by P GS W . Each emotion-animal phrase consists of an adjective for emotion (e.g., "happy", "sad") qualifying a noun for animal (e.g., "horse", "man"). If W is able to capture the emotion information, then we expect that the similarity of phrases of same animal but different emotions (e.g., "happy horse", "sad horse") is high in C (because they are the same animal) but is low in W (because the emotions are different). On the contrary, if the emotion is the same but the animals are different (e.g., "happy horse", "happy cat"), then we expect that the similarity is high in W but low in C. The result is tabulated in Table 4 and coincide with our hypothesis: In W, emotions dominate the similarity rather than the animals, showing that W can extract emotion information from the phrases. Table 4 : Averaged similarity of emotion-animal phrases. In W, emotions dominate the similarity rather than the animals, showing that W can capture emotion information from the phrases.

C

W same animal, different emotions 0.863 0.439 same emotion, different animals 0.834 0.926

A.5.2 EXTRACT EMOTION FROM EMOTION TEXTS

We collect five emotions from each of the 5 groups happy, sad, angry, disgusted, fearful (25 emotions in total) and use heat maps to visualize their pairwise similarity in C, in the emotion subspaces created by P GS W and P PCA W . The result is shown in Figure 10 . We see that emotions in the same group have higher similarity in both emotion subspaces than in C.

A.6 FULL RESULT OF TEXT-GUIDED FACE EDITING

In this section we present the full result of the text-guided face editing that could not be placed in Section 5 due to the space limit. Figures 11 to 13 show the result of emotion, hairstyle and physical characteristic editing, respectively, using different PAEs; The text prompts as well as the type of PAE used are written to the left of each row. Figure 14 to Figure 26 compare the eight models in face editing for different text prompts in the aforementioned three editing categories. Figure 27 shows the controllability of our method by varying the augmenting power. 

A.7 TEXT-GUIDED EDITING ON NON-FACIAL IMAGES

We also conduct text-guided editing on non-facial images, namely on images of dogs from AFHQ-Dog dataset (Choi et al., 2020) . The experiment procedure is very similar to Section 5 except for the three differences: • StyleGAN2 is pre-trained on the AFHQ-Dog dataset instead of the FFHQ dataset; • Since StyleCLIP (Patashnik et al., 2021) and TediGAN (Xia et al., 2021) do not release a StyleGAN2 model pre-trained on AFHQ-Dog, we only include the comparison of Naive, Naive+PAE, StyleMC, and StyleMC+PAE; • The fur subspace is constructed by P PCA fur defined via extracting principal components from a corpus of nine fur color texts: white fur, black fur, orange fur, brown fur, gray fur, golden fur, yellow fur, red fur, and blue fur. 



Figure 1: Using text embedding as the optimization target results in unsatisfactory outputs. Note the entangled changes (e.g., clothing of the left child, glasses of the lady), inaccurate output, and artifacts.

Figure 2: PCA visualization of face images and text descriptions (Section 3.1), and their corresponding PAEs in the CLIP space (Section 4). Note that image and text embeddings are non-overlapping, that PAEs are closer to image, and that PAEs spread less widely than the face images, indicating that they contain more specified information.

Figure3: Similarity of video frames to the first frame and to the text "a happy face" in different spaces. The changes in the emotion subspace are the most significant as it distills the relevant information.

Integration of PAE into the CLIP-based textguided image manipulation paradigm. A graphical demonstration of the calculation of PAE. It is calculated by 1. projecting the embedded image eI (black) to a pre-defined subspace W of interest (red); 2. augmenting the projected vector in a way that the effect of text is amplified (blue); and finally 3. adding back the residual r to return to the "image region" in C (purple).

Comparison of eight models in text-guided face editing. We see that PAE promotes disentangled editing.

Figure 5: Experiment results

Figure 6: PCA visualization of 100 random images and 100 random texts in the CLIP space. Note that image and text embeddings lie in differet regions.

(e T , e It ) A: sim(PAE,e It ) B: sim(PAE,e I ) C: A B D: sim(PAE + ,e It ) E: sim(PAE + ,e I ) F: D E G: sim(PAE GS ,e It ) H: sim(PAE GS ,e I ) I: G H J: sim(PAE GS, + ,e It ) K: sim(PAE GS, + ,e I ) L: J K M: sim(PAE PCA ,e It ) N: sim(PAE PCA ,e I ) O: M N P: sim(PAE PCA, + ,e It ) Q: sim(PAE PCA, + ,e I ) R: P Q S: sim(PAE All, Ex ,e It ) T: sim(PAE All, Ex ,e I ) U: S T V: sim(PAE All, ExD ,e It ) W: sim(PAE All, ExD ,e I ) X: V W

Figure 7: Comparison of eight options of PAE in a text-guided facial emotion editing experiment.Each option is color-coded. In order for W to be effective, we need the solid line to be as large as possible and at least be above the orange horizontal line and the dash-dotted line to be above the black horizontal line. In this experiment, the best three options are E + W (α ∈ [5, 15]), E GS,+

Figure 8: Text-guided emotion editing using DPE GS . Note that changes are not disentangled.

Figure 9: Text-guided emotion editing using DPE PCA . Note that changes are not disentangled.

Figure 10: Pairwise similarity of 25 emotions in five groups in three different spaces. We see that emotions in the same group have higher similarity in both emotion subspaces than in C.

Figure28shows the aggregated result of Naive+PAE approach. The text prompts are written to the left of each row. Figure29to Figure31compare the four models for different text prompts.

Figure 11: Text-guided emotion editing using different PAEs

Seven metrics for eight image editing models. The ↓ besides a metric means that a lower score is better and the ↑ means the opposite.

Yun Zhang, Ruixin Liu, Yifan Pan, Dehao Wu, Yuesheng Zhu, and Zhiqiang Bai. Gi-aee: Gan inversion based attentive expression embedding network for facial expression editing. In 2021 IEEE International Conference on Image Processing (ICIP), pp. 2453-2457. IEEE, 2021. Figure2, we record the averaged cosine similarity of the embeddings from different modalities in Table2. We see that image embeddings and text embeddings lie in different regions in C and have lower inter-modality similarity compared to the intra-modality similarity, regardless of their semantic meanings. From the Table we can also see that PAEs have higher cosine similarity to images (0.491 and 0.493) than texts to images (0.200).

Averaged cosine similarity of CLIP embeddings of 100 random images and texts. Note that image embeddings and text embeddings have lower inter-modality similarity.

93 92 88 91 88 91 87 88 88 86 90 88 88 86 88 89 91 88 88 89 89 85 94 100 91 95 93 90 90 89 90 88 91 88 89 90 87 88 87 88 91 92 90 91 90 92 86 94 91 100 90 91 84 87 85 89 88 85 87 84 87 85 88 85 86 87 89 87 86 88 87 84 93 95 90 100 93 92 89 89 90 88 90 88 88 92 88 89 87 89 90 92 90 91 91 92 86 92 93 91 93 100 90 91 88 92 90 93 90 91 91 89 90 89 91 93 94 90 92 93 94 87 88 90 84 92 90 100 89 89 90 88 89 86 87 95 87 86 85 86 88 90 87 91 90 92 83 91 90 87 89 91 89 100 90 92 91 92 89 91 88 92 90 88 88 88 91 88 89 90 91 84 88 89 85 89 88 89 90 100 91 88 88 84 87 86 87 87 87 84 86 87 86 87 87 88 83

97 96 88 95 95 96 93 91 93 91 93 88 88 93 92 94 93 94 92 94 89 95 98 100 99 99 99 93 98 98 99 97 96 96 95 96 90 92 97 95 98 97 98 97 98 96 98 99 99 100 98 97 89 97 96 98 95 93 95 93 94 89 93 96 95 96 96 96 94 96 92 97 97 99 98 100 99 96 98 99 99 98 96 96 96 98 91 92 97 96 98 98 98 98 98 96 98 96 99 97 99 100 93 98 98 99 98 98 97 98 97 92 93 98 97 99 99 99 99 99 97 99 88 93 89 96 93 100 93 96 94 94 92 90 92 98 87 86 92 90 93 93 93 95 94 92 93 95 98 97 98 98 93 100 98 98 99 98 99 98 97 97 94 99 96 98 97 98 98 97 95 99 95 98 96 99 98 96 98 100 98 99 96 97 96 98 94 94 98 95 97 96 98 97 97 94 98 96 99 98 99 99 94 98 98 100 99 97 97 97 97 92 95 98 98 99 99 99 98 99 97 99

89 80 48 62 55 65 48 49 63 41 62 49 53 57 61 71 69 65 56 61 62 56 94 100 87 97 91 66 71 65 77 60 67 61 58 70 50 54 61 66 81 80 81 77 79 80 64 96 87 100 77 72 30 55 43 60 46 39 61 33 47 41 55 51 58 65 65 62 44 54 53 55 89 97 77 100 91 77 70 70 80 59 69 62 58 81 52 51 61 66 82 81 80 81 83 84 67 80 91 72 91 100 66 78 58 80 69 85 75 78 74 66 67 72 80 94 91 88 90 90 93 73 48 66 30 77 66 100 61 61 64 46 61 44 49 89 43 30 45 51 65 69 57 78 74 78 50 62 71 55 70 78 61 100 73 83 92 87 78 88 58 85 74 77 72 76 80 75 76 70 84 68 55 65 43 70 58 61 73 100 87 70 62 47 53 45 47 53 71 47 54 56 58 52 64 64 56 65 77 60 80 80 64 83 87 100 85 73 69 68 57 57 78 80 77 82 85 81 72 86 84 84 48 60 46 59 69 46 92 70 85 100 80 69 85 39 72 75 67 65 69 74 81 71 67 79 76 49 67 39 69 85 61 87 62 73 80 100 74 94 60 82 61 73 62 78 76 82 87 76 91 58 63 61 61 62 75 44 78 47 69 69 74 100 67 63 89 84 83 80 81 81 63 61 67 77 62 41 58 33 58 78 49 88 53 68 85 94 67 100 49 81 63 65 65 73 72 76 83 67 83 63 62 70 47 81 74 89 58 45 57 39 60 63 49 100 57 37 48 63 74 76 57 76 71 78 57 49 50 41 52 66 43 85 47 57 72 82 89 81 57 100 69 73 64 66 65 56 63 50 72 48 53 54 55 51 67 30 74 53 78 75 61 84 63 37 69 100 89 91 81 81 56 51 71 67 68 57 61 51 61 72 45 77 71 80 67 73 83 65 48 73 89 100 82 81 78 53 54 75 73 57 61 66 58 66 80 51 72 47 77 65 62 80 65 63 64 91 82 100 93 93 60 67 83 76 77 71 81 65 82 94 65 76 54 82 69 78 81 73 74 66 81 81 93 100 98 79 84 94 91 78 69 80 65 81 91 69 80 56 85 74 76 81 72 76 65 81 78 93 98 100 79 83 93 92 84 65 81 62 80 88 57 75 58 81 81 82 63 76 57 56 56 53 60 79 79 100 87 82 89 77 56 77 44 81 90 78 76 52 72 71 87 61 83 76 63 51 54 67 84 83 87 100 86 95 65 61 79 54 83 90 74 70 64 86 67 76 67 67 71 50 71 75 83 94 93 82 86 100 92 75 62 80 53 84 93 78 84 64 84 79 91 77 83 78 72 67 73 76 91 92 89 95 92 100 74 56 64 55 67 73 50 68 56 84 76 58 62 63 57 48 68 57 77 78 84 77 65 75 74 100

8. REPRODUCIBILITY STATEMENT

The results in the main part and the appendix of this paper can be reproduced using the source code submitted as supplementary material together with this paper. The project web page as well as the repository will be made publicly available after the blind review process. 

