CLIP-PAE: PROJECTION-AUGMENTATION EMBED-DING TO EXTRACT RELEVANT FEATURES FOR A DIS-ENTANGLED, INTERPRETABLE, AND CONTROLLABLE TEXT-GUIDED IMAGE MANIPULATION

Abstract

Recently introduced Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.

1. INTRODUCTION

In text-guided image manipulation, the system receives an image and a text prompt and is tasked with editing the image according to the text prompt. Such tasks have received high research attention due to the great expressive power of natural language. Recently introduced and increasingly popular Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) is a technique to achieve this by embedding images and texts into a joint latent space. Combined with generative techniques such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Diffusion (Ho et al., 2020; Dhariwal & Nichol, 2021) , CLIP has been utilized to develop several high-quality image manipulation methods (e.g., Ling et al., 2021; Zhang et al., 2021; Antal & Bodó, 2021; Khodadadeh et al., 2022) , where the image is optimized to be similar to the text prompt in the CLIP joint space. There are three important but difficult to satisfy properties when performing image manipulation: disentanglement, interpretability, and controllability. Disentanglement means that the manipulation should only change the features referred to by the text prompt and should not affect other irrelevant attributes (Wu et al., 2021; Xu et al., 2022) . Interpretability means that we know why/how an edit to the latent code affects the output image and thus we understand the reasoning behind each model decision (Doshi-Velez & Kim, 2018; Miller, 2019) , or that the model can extract relevant information from the given data (Murdoch et al., 2019) . Finally, controllability is the ability to control the intensity of the edit (You et al., 2021; Park et al., 2020; Li et al., 2019) for individual factors and hence is tightly related to disentanglement. In CLIP-based text-guided image manipulation methods, since both the latent space of the generative network and the embedding space of CLIP extensively compress information (e.g., an 1024 × 1024 image contains 3 × 1024 2 dimensions, whereas a typical StyleGAN (Karras et al., 2019) and CLIP both only have 512-dimensional latent spaces), the manipulation in the latent space is akin to a black box. Gabbay et al. (2021) argue that a GAN fails at disentanglement because it only focuses on localized features. Zindancıoglu & Sezgin (2021) also show that several action units in StyleGAN (Karras et al., 2019) are correlated even if they are responsible for semantically distinct attributes in the image. In addition, we found that although CLIP embeddings of images and texts share the same space, they actually reside far away from each other (see Section 3.1), which can lead to undesired artifacts in the generated images such as unintended editing or distortion of facial identity (see Figure 1 ). Finally, most existing methods for CLIP-based text-guided image manipulation do not allow for free control of the magnitude of the edit. As a result, how to perform image manipulation in a disentangled, interpretable, and controllable way with the help of text models remains a hard and open problem. In this paper, we introduce a technique that can be applied to any CLIP-based text-guided image manipulation method to yield a more disentangled, interpretable, and controllable performance. Rather than optimizing an image directly towards the text prompt in the CLIP joint latent space, we propose a novel type of CLIP embedding, projection-augmentation embedding (PAE), as the optimization target. PAE was motivated by two empirical observations. First, the images do not overlap with their corresponding texts in the joint space. This means that a text embedding does not represent the embedding of the true target image that should be optimized for. Therefore, directly optimizing an image to be similar to the text in the CLIP space results in undesirable artifacts or changes in irrelevant attributes. Second, a CLIP subspace constructed via embeddings of a set of relevant texts can constrain the changes of an image with only relevant attributes. To construct a PAE, we first project the embedding of the input image onto a corpus subspace constructed by relevant texts describing the attributes we aim to disentangle, and record a residual vector. Next, we augment this projected embedding in the subspace with the target text prompt, allowing for a user-specified augmenting power to control the intensity of change. Finally, we add back the residual to reconstruct a vector close to the "image region" of the joint space. We demonstrate that the PAE is a better approximation to the embedding of the true target image. With PAE, we achieve better interpretability via an explicit construction of the corpus subspace. We achieve better disentanglement since the subspace constrains the changes of the original image with only relevant attributes. We achieve free control of the magnitude of the edit via a user-specified augmenting power. PAE is easy to use, quick to pre-compute, and can be incorporated into any CLIP-based latent manipulation algorithms to improve performance. In short, we highlight the three major contributions of this paper: 1. (Section 3) We perform a series of empirical analyses of the CLIP space and its subspaces, identifying i) the limitations of using a naive CLIP loss for text-guided image editing; and ii) several unique properties of the CLIP subspace. 2. (Section 4) Based on our findings in Section 3, we propose the projection-augmentation embedding (PAE) as a better approximation to the embedding of the true target image.

3.. (Section 5)

We demonstrate that employing PAE as an alternative optimization target facilitates a more disentangled, interpretable, and controllable text-guided image manipulation. This is validated through several text-guided semantic face editing experiments where we integrate PAE into a set of state-of-the-art models. We quantitatively and qualitatively demonstrate that PAE boosts the performance of all chosen models with high quality and accuracy.

2. RELATED WORK

Latent Manipulation for Image Editing One popular approach for image manipulation is based on its latent code: the input image is first embedded into the latent space of a pre-trained generative network such as GAN (Goodfellow et al., 2014) , and then to modify the image, one updates either the latent code of the image, (e.g., Zhu et al., 2016; Ling et al., 2021; Zhang et al., 2021; Antal & Bodó, 2021; Khodadadeh et al., 2022; Creswell & Bharath, 2018; Perarnau et al., 2016; Pidhorskyi et al., 2020; Hou et al., 2022b; Xia et al., 2021; Kocasari et al., 2022; Patashnik et al., 2021; Shen et al., 2020) , or the weights of the network (e.g., Cherepkov et al., 2021; Nitzan et al., 2022; Reddy 

