DIFFUSION MODELS ALREADY HAVE A SEMANTIC LATENT SPACE

Abstract

Diffusion models achieve outstanding generative performance in various domains. Despite their great success, they lack semantic latent space which is essential for controlling the generative process. To address the problem, we propose asymmetric reverse process (Asyrp) which discovers the semantic latent space in frozen pretrained diffusion models. Our semantic latent space, named h-space, has nice properties to accommodate semantic image manipulation: homogeneity, linearity, robustness, and consistency across timesteps. In addition, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Our method is applicable to various architectures (DDPM++, iD-DPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUNbedroom, and METFACES). Project page:

1. INTRODUCTION

In image synthesis, diffusion models have advanced to achieve state-of-the-art performance regarding quality and mode coverage since the introduction of denoising diffusion probabilistic models (Ho et al., 2020) . They disrupt images by adding noise through multiple steps of forward process and generate samples by progressive denoising through multiple steps of reverse (i.e., generative) process. Since their deterministic version provides nearly perfect reconstruction of original images (Song et al., 2020a) , they are suitable for image editing, which renders target attributes on the real images. However, simply editing the latent variables (i.e., intermediate noisy images) causes degraded results (Kim & Ye, 2021) . Instead, they require complicated procedures: providing guidance in the reverse process or finetuning models for an attribute. Figure 1(a-c ) briefly illustrates the existing approaches. Image guidance mixes the latent variables of the guiding image with unconditional latent variables (Choi et al., 2021; Lugmayr et al., 2022; Meng et al., 2021) . Though it provides some control, it is ambiguous to specify which attribute to reflect among the ones in the guide and the unconditional result, and it lacks intuitive control for the magnitude of change. Classifier guidance manipulates images by imposing gradients of a classifier on the latent variables in the reverse process to match the target class (Dhariwal & Nichol, 2021; Avrahami et al., 2022; Liu et al., 2021) . It requires training an extra classifier for the latent variables, i.e., noisy images. Furthermore, computing gradients through the classifier during sampling is costly. Finetuning the whole model can steer the resulting images to the target attribute without the above problems (Kim & Ye, 2021) . Still, it requires multiple models to reflect multiple descriptions. On the other hand, generative adversarial networks (Goodfellow et al., 2020) inherently provide straightforward image editing in their latent space. Given a latent vector for an original image, we can find the direction in the latent space that maximizes the similarity of the resulting image with a target description in CLIP embedding (Patashnik et al., 2021) . The latent direction found on one image leads to the same manipulation of other images. However, given a real image, finding its exact latent vector is often challenging and produces unexpected appearance changes. It would allow admirable image editing if the diffusion models with the nearly perfect inversion property have such a semantic latent space. Preechakul et al. (2022) introduces an additional input to the reverse diffusion process: a latent vector from an original image embedded by an extra encoder. This latent vector contains the semantics to condition the process. However, it requires training from scratch and does not match with pretrained diffusion models. In this paper, we propose an asymmetric reverse process (Asyrp) which discovers the semantic latent space of a frozen diffusion model such that modifications in the space edits attributes of the original images. Our semantic latent space, named h-space, has the properties necessary for editing applications as follows. The same shift in this space results in the same attribute change in all images. Linear changes in this space lead to linear changes in attributes. The changes do not degrade the quality of the resulting images. The changes throughout the timesteps are almost identical to each other for a desired attribute change. Figure 1 (d) illustrates some of these properties and § 5.3 provides detailed analyses. To the best of our knowledge, it is the first attempt to discover the semantic latent space in the frozen pretrained diffusion models. Spoiler alert: our semantic latent space is different from the intermediate latent variables in the diffusion process. Moreover, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Extensive experiments demonstrate that our method is generally applicable to various architectures (DDPM++, iDDPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, and METFACES).

2. BACKGROUND

We briefly describe essential backgrounds. The rest of the related work is deferred to Appendix A.

2.1. DENOISING DIFFUSION PROBABILITY MODEL (DDPM)

DDPM is a latent variable model that learns a data distribution by denoising noisy images (Ho et al., 2020) . The forward process diffuses the data samples through Gaussian transitions parameterized with a Markov process: q (x t | x t-1 ) = N x t ; 1 -β t x t-1 , β t I = N α t α t-1 x t-1 , 1 - α t α t-1 I , where {β t } T t=1 is the variance schedule and α t = t s=1 (1 -β s ). Then the reverse process becomes p θ (x 0:T ) := p (x T ) T t=1 p θ (x t-1 | x t ), starting from x T ∼ N (0, I) with noise predictor ϵ θ t : x t-1 = 1 √ 1 -β t x t - β t √ 1 -α t ϵ θ t (x t ) + σ t z t , where z t ∼ N (0, I) and σ 2 t is a variance of the reverse process which is set to σ 2 t = β t by DDPM.

2.2. DENOISING DIFFUSION IMPLICIT MODEL (DDIM)

DDIM redefines Eq. ( 1) as q σ (x t-1 |x t , x 0 ) = N ( √ α t-1 x 0 + 1 -α t-1 -σ 2 t • xt- √ αtx0 √ 1-αt , σ 2 t I ) which is a non-Markovian process (Song et al., 2020a) . Accordingly, the reverse process becomes x t-1 = √ α t-1 x t - √ 1 -α t ϵ θ t (x t ) √ α t "predicted x0 " + 1 -α t-1 -σ 2 t • ϵ θ t (x t ) "direction pointing to xt " + σ t z t random noise , where σ t = η (1 -α t-1 ) / (1 -α t ) 1 -α t /α t-1 . When η = 1 for all t, it becomes DDPM. As η = 0, the process becomes deterministic and guarantees nearly perfect inversion.

2.3. IMAGE MANIPULATION WITH CLIP

CLIP learns multimodal embeddings with an image encoder E I and a text encoder E T whose similarity indicates semantic similarity between images and texts (Radford et al., 2021) . Compared to directly minimizing the cosine distance between the edited image and the target description (Patashnik et al., 2021) , directional loss with cosine distance achieves homogeneous editing without mode collapse (Gal et al., 2021) : L direction x edit , y target ; x source , y source := 1 - ∆I • ∆T ∥∆I∥∥∆T ∥ , where ∆T = E T (y target ) -E T (y source ) and ∆I = E I x edit -E I (x source ) for edited image x edit , target description y target , original image x source , and source description y source . We use the prompts 'smiling face' and 'face' as the target and source descriptions for facial attribute smiling.

3. DISCOVERING SEMANTIC LATENT SPACE IN DIFFUSION MODELS

This section explains why naive approaches do not work and proposes a new controllable reverse process. Then we describe the techniques for controlling the generative process. Throughout this paper, we use an abbreviated version of Eq. ( 3): x t-1 = √ α t-1 P t (ϵ θ t (x t )) + D t (ϵ θ t (x t )) + σ t z t , where P t (ϵ θ t (x t )) denotes the predicted x 0 and D t (ϵ θ t (x t )) denotes the direction pointing to x t . We omit σ t z t for brevity, except when η ̸ = 0. We further abbreviate P t (ϵ θ t (x t )) as P t and D t (ϵ θ t (x t )) as D t when the context clearly specifies the arguments.

3.1. PROBLEM

We aim to allow semantic latent manipulation of images x 0 generated from x T given a pretrained and frozen diffusion model. The easiest idea to manipulate x 0 is simply updating x T to optimize the directional CLIP loss given text prompts with Eq. ( 4). However, it leads to distorted images or incorrect manipulation (Kim & Ye, 2021) . An alternative approach is to shift the noise ϵ θ t predicted by the network at each sampling step. However, it does not achieve manipulating x 0 because the intermediate changes in P t and D t cancel out each other resulting in the same p θ (x 0:T ), similarly to destructive interference. Theorem 1. Let ϵ θ t be a predicted noise during the original reverse process at t and εθ t be its shifted counterpart. Then, ∆x t = xt-1 -x t-1 is negligible where xt-1 = √ α t-1 P t (ε θ t (x t )) + D t (ε θ t (x t ) ). I.e., the shifted terms of εθ t in P t and D t destruct each other in the reverse process. Appendix C proves above theorem. Figure 13 (a-b) shows that x0 is almost identical to x 0 .

3.2. ASYMMETRIC REVERSE PROCESS

In order to break the interference, we propose a new controllable reverse process with asymmetry: i.e., we modify only P t by shifting ϵ θ t to εθ t while preserving D t . Intuitively, it modifies the original reverse process according to ∆ϵ t = εθ t -ϵ θ t while it does not alter the direction toward x t so that x t-1 follows the original flow D t at each sampling step. Figure 2 illustrates the above intuition. x t-1 = √ α t-1 P t (ε θ t (x t )) + D t (ϵ θ t (x t )), As in Avrahami et al. (2022) , we use the modified P edit t and the original P source t as visual inputs for the directional CLIP loss in Eq. ( 4), and regularize the difference between the modified P edit t and the original P source t . We find ∆ϵ = arg min ∆ϵ Et L (t) where L (t) = λ CLIP L direction P edit t , y ref ; P source t , y source + λ recon P edit t -P source t (7) Although ∆ϵ indeed renders the attribute in the x edit 0 , ϵ-space lacks the necessary properties of the semantic latent space in diffusion models that will be described in the following.

3.3. h-space

Note that ϵ θ t is implemented as U-Net in all state-of-the-art diffusion models. We choose its bottleneck, the deepest feature maps h t , to control ϵ θ t . By design, h t has smaller spatial resolutions and high-level semantics than ϵ θ t . Accordingly, the sampling equation becomes x t-1 = √ α t-1 P t (ϵ θ t (x t |∆h t )) + D t (ϵ θ t (x t )) + σ t z t , where ϵ θ t (x t |∆h t ) adds ∆h t to the original feature maps h t . The ∆h t minimizing the same loss in Eq. ( 7) with P t (ϵ θ t (x t |∆h t )) instead of P t (ε θ t (x t )) successfully manipulates the attributes. We observe that h-space in Asyrp has the following properties that others do not have. • The same ∆h leads to the same effect on different samples. • Linearly scaling ∆h controls the magnitude of attribute change, even with negative scales. • Adding multiple ∆h manipulates the corresponding multiple attributes simultaneously. • ∆h preserves the quality of the resulting images without degradation. • ∆h t is roughly consistent across different timesteps t. The above properties are demonstrated thoroughly in § 5.3. Appendix D.3 provides details of h-space and suboptimal results from alternative choices.

3.4. IMPLICIT NEURAL DIRECTIONS

Although ∆h succeeds in manipulating images, directly optimizing ∆h t on multiple timesteps requires many iterations of training with a carefully chosen learning rate and its scheduling. Instead, we define an implicit function f t (h t ) which produces ∆h t for given h t and t. f t is implemented as a small neural network with two 1 × 1 convolutions concatenating timestep t. See Appendix E for the details. Accordingly, we optimize the same loss in Eq. ( 7) with P edit t = P t (ϵ θ t (x t |f t )). Learning f t is more robust to learning rate settings and converges faster than learning every ∆h t . In addition, as f t learns an implicit function for given timesteps and bottleneck features, it generalizes Figure 3 : Intuition for choosing the intervals for editing and quality boosting. We choose the intervals by quantifying two measures (top left inset). The editing strength of an interval [T, t] measures its perceptual difference from T until t. We set [T, t] to the interval with the smallest editing strength that synthesizes P t close to x, i.e., LPIPS(x, P t ) = 0.33. Editing flexibility of an interval [t, 0] measures the potential amount of changes after t. Quality deficiency at t measures the amount of noise in x t . We set [t, 0] to handle large quality deficiency (i.e., LPIPS(x, x t ) = 1.2) with small editing flexibility. to unseen timesteps and bottleneck features. The generalization allows us to borrow the accelerated training scheme of DDIM defined on a subsequence {x τi } ∀i∈ [1,S] where {τ i } is a subsequence of [1, ..., T ] and S < T . Then, we can use the generative process with a custom subsequence {τ i } with length S < T through normalization: ∆ hτ = f τ (h τ )S/ S. It preserves the amount of ∆h t , ∆ hτ S = ∆h t S. Therefore, we can use f t trained on any subsequence for any length of the generative process. See Appendix F for details. We use f t to get ∆h t for all experiments except Figure 6 .

4. GENERATIVE PROCESS DESIGN

This section describes the entire editing process, which consists of three phases: editing with Asyrp, traditional denoising, and quality boosting. We design formulas to determine the length of each phase with quantifiable measures.

4.1. EDITING PROCESS WITH ASYRP

Diffusion models generate the high-level context in the early stage and imperceptible fine details in the later stage (Choi et al., 2022) . Likewise, we modify the generative process in the early stage to achieve semantic changes. We refer to the early stage as the editing interval [T, t edit ]. LPIPS(x, P T ) and LPIPS(x, P t ) calculate the perceptual distance between the original image and the predicted image at time steps T and t, respectively. Intuitively, the high-level content is already determined by the predicted terms at the respective timesteps and LPIPS measures the remaining component to be edited through the remaining reverse process. Consequently, we define editing strength of an interval [T, t]: ξ t = LPIPS(x, P T ) -LPIPS(x, P t ) indicating the perceptual change from timestep T to t in the original generative process. Figure 3 illustrates LPIPS(x, •) for P t and x t with examples and the inset depicts editing strength. The shorter editing interval has the lower ξ t , and the longer editing interval brings more changes to the resulting images. We seek the shortest editing interval which will bring enough distinguishable changes in the images in general. We empirically find that t edit with LPIPS(x, P t edit ) = 0.33 builds the shortest editing interval with enough editing strength as P t edit has nearly all visual attributes in x. However, some attributes require more visual changes than others, e.g., pixar > smile. For such attributes, we increase the editing strength ξ t by δ = 0.33d(E T (y source ), E T (y target )) where E T (•) produces CLIP text embedding, y (•) denotes the descriptions, and d(•, •) computes the cosine distance between the arguments. Choosing t edit with LPIPS(x, P t edit ) = 0.33 -δ expands the editing interval to a suitable length. It consistently produces good results in various settings. The supporting experiments are shown in Appendix G.

4.2. QUALITY BOOSTING WITH STOCHASTIC NOISE INJECTION

Although DDIM achieves nearly perfect inversion by removing stochasticity (η = 0), Karras et al. (2022) demonstrate that stochasticity improves image quality. Likewise, we inject stochastic noise in the boosting interval [t boost , 0]. Though the longer boosting interval would achieve higher quality, boosting over excessively long intervals would modify the content. Hence, we want to determine the shortest interval that shows enough quality boosting to guarantee minimal change in the content. We consider the noise in the image as the capacity for the quality boosting and define quality deficiency at t: γ t = LPIPS(x, x t ) indicating the amount of noise in x t compared to the original image. We use x t instead of P t because we consider the actual image rather than the semantics. Figure 3 inset depicts editing flexibility and quality deficiency. We empirically find that t boost with γ t boost = 1.2 achieves quality boosting with minimal content change. We confirmed that the editing strength of the intervals [t boost , 0] is guaranteed to be less than 0.25. In Figure 3 , after t boost , LPIPS(x, x t ) sharply drops in the original generative process while LPIPS(x, P t ) changes little. Note that most of the quality degradation of the resulting images is caused by DDIM reverse process, not by Asyrp. We use this quality boosting for all experiments except ablation in Appendix H.

4.3. OVERALL PROCESS OF IMAGE EDITING

Using t edit and t boost determined by the above formulas, we modify the generative process of DDIM with p (t) θ (x t-1 | x t ) =    N √ α t-1 P t (ϵ θ t (x t |f t )) + D t , σ 2 t I , η = 0 if T ≥ t ≥ t edit N √ α t-1 P t (ϵ θ t (x t )) + D t , σ 2 t I , η = 0 if t edit > t ≥ t boost N √ α t-1 P t (ϵ θ t (x t )) + D t , σ 2 t I , η = 1 if t boost > t The visual overview and comprehensive algorithms of the entire process are in Appendix I.

5. EXPERIMENTS

In this section, we show the effectiveness of semantic latent editing in h-space with Asyrp on various attributes, datasets and architectures in § 5.1. Moreover, we provide quantitative results including user study in § 5.2. Lastly, we provide detailed analyses for the properties of the semantic latent space on h-space and alternatives in § 5.3. Implementation details. We implement our method on various settings: CelebA-HQ (Karras et al., 2018) and LSUN-bedroom/-church (Yu et al., 2015) on DDPM++ (Song et al., 2020b ) (Meng et al., 2021) ; AFHQ-dog (Choi et al., 2020) on iDDPM (Nichol & Dhariwal, 2021) ; and MET-FACES (Karras et al., 2020) on ADM with P2-weighting (Dhariwal & Nichol, 2021 ) (Choi et al., 2022) . Please note that all models are official pretrained checkpoints and are kept frozen. Detailed settings including the coefficients for λ CLIP and λ recon , and source/target descriptions can be found in Appendix J.1. We train f t with S = 40 for 1 epoch using 1000 samples. The real samples are randomly chosen from each dataset for in-domain-like attributes. For out-of-domainlike attributes, we randomly draw 1,000 latent variables x T ∼ N (0, I). dataset, such as church -→ {department, factory, and temple}. Even for dogs, our method synthesizes smiling Poodle and Yorkshire, the species that barely smile in the dataset. Figure 5 provides results for changing human faces to different identities, painting styles, and ancient primates. More result can be found in Appendix N. Versatility of our method is surprising because we do not alter the models but only shift the bottleneck feature maps in h-space with Asyrp during inference. . Note that we did not train f t for the negative direction.

5.2. QUANTITATIVE COMPARISON

Considering that our method can be combined with various diffusion models without finetuning, we do not find such a versatile competitor. Nonetheless, we compare Asyrp against DiffusionCLIP using the official code that edits the real images by finetuning the whole model. We asked 80 participants to choose the images with better quality, natural attribute change, and overall preference for given total of 40 sets of original images, ours, and DiffusionCLIP. Table 1 shows that Asyrp outperforms DiffusionCLIP in the all perspectives including the attributes unseen in the training dataset. We list the settings for fair comparison including the questions and example images in Appendix K.1. See § K.2 for more evaluation metrics: segmentation consistency (SC) and directional CLIP similarity(S dir ).

5.3. ANALYSIS ON h-space

We provide detailed analyses to validate the properties of semantic latent space for diffusion models: homogeneity, linearity, robustness, and consistency across timesteps. Homogeneity. Figure 6 illustrates homogeneity of h-space compared to ϵ-space. One ∆h t optimized for an image results in the same attribute change to other input images. On the other hand, one ∆ϵ t optimized for an image distorts other input images. In Figure 10 , applying ∆h mean Linearity. In Figure 7 , we observe that linearly scaling a ∆h reflects the amount of change in the visual attributes. Surprisingly, it generalizes to negative scales that are not seen during training. Moreover, Figure 8 shows that combinations of different ∆h's yield their combined semantic changes in the resulting images. Appendix N.2 provides mixed interpolation between multiple attributes. 

Real image

Real image

Young Pixar

Figure 10 : Consistency on h-space. Results of ∆h t , ∆h mean t and ∆h global are almost identical for in-domain samples. However, we choose ∆h t over others to prevent unexpected small difference in the unseen domain. We provide more results in § L.1. Robustness. Figure 9 compares the effect of adding random noise in h-space and ϵ-space. The random noises are chosen to be the vectors with random directions and magnitude of the example ∆h t and ∆ϵ t in Figure 6 on each space. Perturbation in h-space leads to realistic images with a minimal difference or some semantic changes. On the contrary, perturbation in ϵ-space severely distorts the resulting images. See Appendix D.2 for more analyses. Consistency across timesteps. Recall that ∆h t for all samples are homogeneous and replacing them by their mean ∆h mean t yields similar results. Interestingly, in Figure 10 , we observe that adding a time-invariant ∆h global = 1 Te t ∆h mean t instead of ∆h t also yields similar results where T e denotes the length of the editing interval [T, t edit ]. Though we use ∆h t to deliver the best quality and manipulation, using ∆h mean t or even ∆h global with some compromise would be worth trying for simplicity. We report more detail about mean and global direction in Appendix L.1.

6. CONCLUSION

We proposed a new generative process, Asyrp, which facilitates image editing in a semantic latent space h-space for pretrained diffusion models. h-space has nice properties as in the latent space of GANs: homogeneity, linearity, robustness, and consistency across timesteps. The full editing process is designed to achieve versatile editing and high quality by measuring editing strength and quality deficiency at timesteps. We hope that our approach and detailed analyses help cultivate a new paradigm of image editing in the semantic latent space of diffusion models. Combining previous finetuning or guidance techniques would be an interesting research direction. 2015), denoising diffusion probabilistic models (DDPMs) provide a universal approach for generative modeling (Ho et al., 2020) . On the other hand, Song et al. (2020b) suggests score-based model and unifies SDEs incorporating diffusion models with score-based models. Subsequent works renovate diffusion models by focusing on architectures, scheduling, weighting, and fast sampling (Nichol & Dhariwal (2021) , Karras et al. (2022) , Choi et al. (2022) , Song et al. (2020a) , Watson et al. (2022) ). They mainly consider random generation rather than controlled generation. In the meantime, Dhariwal & Nichol (2021) introduces classifier guidance not only improving the quality of images but also retrieving specific class of images. Since it can apply any guidance, its variants have emerged (Sehwag et al. (2022) , Avrahami et al. (2022) , Liu et al. (2021) , Nichol et al. (2021) ). However it requires a noise-dependent classifier (or any off-the-shelf models) and additional cost to compute gradients for the guidance during its sampling process. The other works try to control the generative process using image-space guidance (Choi et al. (2021) , Meng et al. (2021) , Lugmayr et al. (2022) , Avrahami et al. (2022) ). They manipulate resulting images by matching noisy images with target images during the reverse process. Still, it is hard to expect delicate control of the reverse process from the image guidance. Furthermore, Preechakul et al. (2022) introduces an extra encoder which encodes the semantic features of a real image in order to condition the generative process. Although the semantics allow one to control diffusion models, it requires additional training from scratch with the encoder and inherently can not use the other pretrained diffusion models. For controllability, Rombach et al. (2022) and Vahdat et al. (2021) apply another approach which adapts VAE (Kingma & Welling, 2013) and autoencoder (Rumelhart et al., 1985) to diffusion models. In spite of their great success in editing, their diffusion models learn the distribution of the learned embeddings in VAE or autoencoder, not the images. Kim & Ye (2021) proposes another strategy: fine-tuning a whole diffusion model for image editing. It shows valid performance but it requires each fine-tuned model corresponding each attribute. In comparison, Asyrp enables outstanding manipulation without high computation, specifically designed architectures, or fine-tuning whole models. 2022)). However they have to conduct 'inversion' to their latent space for real image editing and 'GAN inversion' is often challenging and produces unexpected appearance changes. On the contrary, Asyrp enables to use latent space of real images by nearly perfect easy inversion of DDIM.

B MORE DISCUSSION

In this section, we discuss the pros and cons of diffusion-model-based and GAN-based methods. And we provide guidelines for further improvements. 2021)require careful inversion from real images to latent codes for real image editing. On the contrary, our proposed method based on diffusion models has a powerful advantage; the sophisticated inversion method is not necessary. This means that we can obtain the latent code of an arbitrary real image even if the image is not in the trained domain. On the other hand, several inversion methods have been proposed for GANs to obtain the latent of the real image, and the corresponding latent manipulation method should be considered for each inversion method. For example, it is difficult to apply the method of editing in w space to the method of inversion using w + space. However, GANs have the advantage of fast sampling. In addition, diffusion models have a relatively slow sampling time. Additionally, we have to be aware of the time steps of diffusion models, which is still less well known. The advantage of being free from Inversion provides the following milestones. The manipulation in the latent of the diffusion models is the same as the editing in real images. It can be expanded to segmentation, clustering, classification, etc. in h-space for real-world images. It would be an interesting research direction to employ previous techniques. Our method can be used in conjunction with gradient guidance methods. Although we do not focus on random sampling, ours works effectively for sampling with stochastic. (See § M.) It may bring more diverse methods to steer diffusion models. h-space in the latent diffusion models such as stable diffusion, is another interesting research direction. The main contribution of our paper is only modifying P t while preserving D t , and can be adapted with latent diffusion models. However, since the latent meaning may be different due to structural differences, research on this is needed. Furthermore, all of the properties of h-space according to the time step has not been fully discussed so far. Research on them can be expected to expand further. Limitations. Editing with Asyrp seldom yields changes in overall style or peripheral objects but edits attributes of the main object. Style transfer using frozen diffusion models is our future work. Societal impact / Ethics statement. Techniques for high-quality image manipulation such as Asyrp should be accompanied by social and/or technical solutions to prevent abuse. We acknowledge the potential ethical implications that may arise from the use of our image manipulation technique, Asyrp. We advocate for the development and implementation of social and technical solutions to prevent potential abuses such as spreading disinformation or propaganda. We are committed to ensuring fairness and non-discrimination, legal compliance, and research integrity in our work.

C PROOF OF THEOREM 1

Proof of Theorem 1. Let ϵ θ t be a predicted noise during the original reverse process at t and εθ t be its shifted counterpart. Then, ∆x t = xt-1 -x t-1 is negligible where xt-1 = √ α t-1 P t (ε θ t (x t ))+ D t (ε θ t (x t )). Figure 12 : Illustration of Theorem 1. Upper blue line describes applying noise εt = ϵ t + ∆ϵ t to produce P t (ε t ) = the shifted predicted x 0 . However, the shift due to ∆ϵ t is canceled out by the shift in D t (ε t ) due to ∆ϵ t . As a results, applying ∆ϵ t both on P t and D t brings identical outputs to the original. Define εθ t (x t ) = ϵ θ t (x t ) + ∆ϵ t , {β t } T t=1 = {β 1 = β min , ..., β T = β max }, and α t = t s=1 (1β s ). Note that β max is defined as a small value (e.g., β max = 0.001) and {β t } T t=1 are defined by a decreasing schedule from β T = β max to β 1 = β min ≈ 0 (e.g., β min = 0.00001). Then, xt-1 = √ α t-1 P t (ε θ t (x t )) + D t (ε θ t (x t )) (10) = √ α t-1 x t - √ 1 -α t ϵ θ t (x t ) + ∆ϵ t √ α t + 1 -α t-1 • ϵ (θ) t (x t ) + ∆ϵ t (11) = √ α t-1 P t (ϵ θ t (x t )) + D t (ϵ θ t (x t )) - √ α t-1 √ 1 -α t √ α t • ∆ϵ t + 1 -α t-1 • ∆ϵ t (12) = x t-1 + - √ 1 -α t √ 1 -β t + 1 -α t-1 • ∆ϵ t (13) = x t-1 +   - √ 1 -α t √ 1 -β t + 1 - t-1 s=1 (1 -β s ) √ 1 -β t √ 1 -β t   • ∆ϵ t (14) = x t-1 + √ 1 -α t -β t - √ 1 -α t √ 1 -β t • ∆ϵ t ∵ α t = t s=1 (1 -β s ) (15) ∴ ∆x t = xt-1 -x t-1 = √ 1 -α t -β t - √ 1 -α t √ 1 -β t • ∆ϵ t is negligible ∵ β t < β max (16) D ADDITIONAL SUPPORTS FOR h-space WITH ASYRP D.1 RANDOM PERTURBATION ON ϵ-space WITHOUT ASYRP In § 3.1, we argue that if both P t and D t are shifted, we can not manipulate x 0 . In Figure 13 , we do not observe the noticeable difference between (a) and (b) which are the result of the original reverse process of DDIM and the one with shifting both terms, respectively.

D.2 ROBUSTNESS AND SEMANTICS IN h-space AND ϵ-space WITH ASYRP

In § 3.2, we also argue that h-space is more robust than ϵ-space with Asyrp. In Figure 13 , we observe that small random noise z ∼ N (0, I) in ϵ-space degrades the resulting image without semantic changes (c) and much larger random noise in h-space yields random semantic changes without severe artifacts (d). Note that diffusion models are designed as latent variable models with learned Gaussian transitions and the reverse process should also be close to Gaussian. Based on the assumption, we manage ϵ θ t to follow the original Gaussian distribution as follows. Adding z ∼ N (0, σI) to ϵ θ t expands the distribution of the predicted noise and may produce distorted images. To preserve the distribution, we scale εθ t = (ϵ θ t + z)/ √ 1 2 + σ 2 . Still, the resulting images are almost identical compared to the original images where εθ t = ϵ θ t + z as shown in Figure 13 . It is no wonder that the scaling does not improve the distorted results since the additive random noise disturbs the denoising operation of the predicted random noise. 

D.3 CHOICE OF h-space IN U-NET

As shown in Figure 14 , there are many other candidates for h-space in the architecture. Among the layers, we choose the 8th layer, the bridge of the U-Net based architecture. The layer is not influenced by any skip connection, has the smallest spatial dimension with compressed information, and is located just before the upsampling blocks. Thus, we assume that it could possibly be considered as the most suitable latent embedding. To confirm the assumption, we train f t on the other layers. The results are shown in Figure 15 . We carefully tuned the training hyperparameters (λ CLIP and λ recon ) for fair comparison. The 1st to the 6th layers hardly bring visible changes. The 7th and 9th layers bring not only the desired changes but also difficulty in finding optimal hyperparameters. After the 9th layer, the results bear severe artifacts.

E IMPLICIT NEURAL DIRECTIONS

Figure 16 illustrates the neural implicit function f t . It has only two 1x1 convolution layers with 512 channels. Note that we haven't explored the network architecture much. 𝑡 256 ! ×128 128 ! ×128 64 ! ×256 32 ! ×256 16 ! ×512 8 ! ×512 8 ! ×512 8 ! ×512 256 ! ×128 16 ! ×512 32 ! ×256 64 ! ×256 128 ! ×128 256 ! ×3 256 ! ×3 𝑥 ! 𝜖 " (𝑥 ! , 𝑡) Sinusoidal encoding

Res block Concatenation

Positional embedding Res block + Att block

Conv3x3

Figure 14 : Location of h-space. The U-Net architecture of diffusion models outputs 256 × 256 images. Each layer is indexed with a number along the operating sequence of the model. The 8th layer is our h-space which is not directly influenced by a skip connection.

F QUALITY IMPROVEMENTS BY NON-ACCELERATED SAMPLING WITH SCALED ∆h t

Figure 17 shows the quality improvements by non-accelerated sampling with scaled ∆h t described in § 3.4. Even with different number of inference steps, we observe similar changes of an attribute if we preserve the sum of ∆h t . This scaling technique allows non-accelerated sampling with 1000 steps for the models trained by accelerated training with 40 steps. Using non-accelerated sampling with scaled ∆h t leads to the same magnitude of manipulation and higher-quality images. In our experiments, it takes about 1.5 seconds to sample for 40 steps and 40 seconds for 1000 steps.

G EDITING STRENGTH AND EDITING FLEXIBILITY G.1 EDITING STRENGTH AND t edit

Figure 18 shows the results according to t edit . If t edit is too high, the length of the editing process becomes too short resulting in insufficient changes. On the contrary, too low t edit causes excessively unnecessary manipulation from the long editing process. We observe that t edit is one of the important hyperparameters. We argue that the formula for choosing t edit using editing strength is reasonable because it applies to all five different datasets despite its sensitivity, even though the choice of LPIPS = 0.33 is empirical. We provide ablation of thresholds in Figure 22 . Additionally, these results imply why we need to use sufficiently low t edit in the unseen domains. Editing interval with insufficient editing strength struggles to escape from the training domain.

G.2 EDITING FLEXIBILITY AND t boost

Table 2 shows the prompts, t edit , and t boost . Intuitively, the attributes with larger visual changes have smaller cosine similarity and require longer editing interval. Figure 19 shows the average LPIPS(x, P t ) and LPIPS(x, x t ) of 100 samples on all datasets. Note that t edit and t boost dif- Figure 18 : Importance of choosing proper t edit . We explore various t edit with smiling. Too short editing interval struggles to manipulate attributes. Excessive editing strength results in degraded images. Figure 20 shows the effect of quality boosting with the original DDIM reverse process. Although the reverse process of DDIM has a nearly-perfect inversion property, we observe some noise by zooming in. Our quality boosting improves the quality of a sample and concurrently keeps nearly perfect inversion property. We observe that t boost is not sensitive, but the larger interval brings the less preservation. Figure 21 shows quality improvements by our quality boosting in Asyrp. 

I ALGORITHM

( M i=1 c i f i τ s (h τ s )) ε = ϵ θ (x τ s |∆h τ s ) ϵ = ϵ θ (x τ s ) σ τ s = 0 // phase 2: denoising else if s ≥ S noise then ε = ϵ = ϵ θ (x τ s ) σ τ s = 0 // phase 3: quality boosting else ε = ϵ = ϵ θ (x τ s ) σ τ s = (1 -α τ s -1 ) / (1 -α τ s ) 1 -α τ s /α τ s-1 z ∼ N (0, 1) xτs-1 = √ α τ s-1 ( xτs - √ 1-ατ s ε √ ατ s ) + 1 -α τ s-1 -σ 2 τ s ϵ + σ τ s z return x0 (manipulated image) As for LSUN-church, we provide a set of four images at once and add a question: 3) Diversity: Which group do you think has a more diverse style? 4) Overall: Which image do you think is better considering the above evaluation criteria?

K.2 SEGMENTATION CONSISTENCY AND DIRECTIONAL CLIP SIMILARITY

We compare Asyrp and DiffusionCLIP using directional CLIP similarity (S dir ) and segmentationconsistency (SC) following the protocols in DiffusionCLIP in Table 4 and Table 5 . A pretrained CLIP (Radford et al., 2021) and segmentation models (Yu et al. (2018) ; Zhou et al. (2019; 2017) ; Lee et al. ( 2020)) are used to compute S dir and SC, respectively. We choose three attributes (smiling, sad, tanned) for CelebA-HQ-in-domain, two attributes (Pixar,Neaderthal) for CelebA-HQunseen-domain and three attributes (department store, ancient, red brick) for LSUNchurch. For a fair comparison, we use official checkpoints of DiffusionCLIP and provide scores of the attributes (tanned, red brick) following Kim & Ye (2021) . Regarding attributes without the official checkpoints (smiling,sad), we train DiffusionCLIP by ourselves with the official code. Extract feature map h τ s from ϵ θ (x (i) τ s ) 13 ∆h τs = f τs (h τs ) 14 P = x(i) τs - √ 1-ατ s ϵ θ (x (i) τs |∆hτ s ) √ ατ s ; P src = x (i) τs - √ 1-ατ s ϵ θ (x (i) τs ) √ ατ s x(i) τs-1 = √ α τs-1 P + 1 -α τs-1 ϵ θ (x (i) τs ) x (i) τs-1 = √ α τs-1 P src + 1 -α τs-1 ϵ θ (x L total ← -λ CLIP L direction (P, y tar , P src , y src ) + λ recon |P -P src | 18 Take a gradient step on L total and update f t We use 100 samples per attribute. Asyrp outperforms DiffusionCLIP on S dir for all attributes. On SC, DiffusionCLIP achieves better or competitive scores. Because DiffusionCLIP manipulates images mostly by focusing on texture or color while preserving structure and shape, it takes advantage of getting higher SC scores. However, in Figure 28 , results of smiling show that it is not proper to edit attributes which require structural manipulation. Note that better SC scores do not guarantee better qualitative performance. Results of Pixar also show the similar tendency of each method. We allow more structural changes than DiffusionCLIP while editing. Lower SC of our method comes from desirable structural changes as shown in Figure 28 . 30 show that the effects of mean direction and global direction are quite similar with ∆h t by f t in various attributes. We compute mean direction and global direction from 20 different images. We argue that h-space is roughly homogeneous across samples and timesteps. However, we observe that h-space is not completely independent to the conditions especially on unseen-domain Table 3 : The coefficients range from 0.5 to 0.8 for in-domain attributes. Unseen domains need slightly stronger λ CLIP s. We also report which attributes we train with random noise sampling. The criterion is which attributes require relatively less maintenance of identity. We use λ recon = CLIP similarity * 3 which reduces L1 loss when an attribute needs a lot of change. 

N MORE SAMPLES N.1 IMAGENET

We conduct extra experiments: editing images with target class using Asyrp with ImageNet pretrained model. We verified that models trained on large datasets, such as ImageNet, can be edited using Asyrp. However, we also observed that in this case the latent space is not partitioned by classes. For an orange, we have different latents for a single orange, for many oranges, for a cross-section of cut orange, and for a single piece of orange. Therefore, we learned the implicit function by collecting similar images to find the direction. 

N.2 MULTI-INTERPOLATION

Figure 33 provides mixed interpolation between multiple attributes. We observe that any interpolation with any attribute is possible.

N.3 MORE RESULTS ON ALL DATASETS

We provide more results on CelebA-HQ (Figure 34 ), LSUN-church (Figure 35 ), AFHQ, LSUNbedroom, METFACES (Figure 36 ). 



Figure 1: Manipulation approaches for diffusion models. (a) Image guidance suffers ambiguity while controlling the generative process. (b) Classifier guidance requires an extra classifier, is hardly editable, degrades quality, or alters the content. (c) DiffusionCLIP requires fine-tuning the whole model. (d) Our method discovers a semantic latent space of a frozen diffusion model.

Figure 2: Generative process of Asyrp. The green box on the left illustrates Asyrp which only alters P t while preserving D t shared by DDIM. The right describes that Asyrp modifies the original reverse process toward the target attribute reflecting the change in h-space.

Figure 4: Editing results of Asyrp on various datasets. We conduct experiments on CelebA-HQ, LSUN-church, METFACES, AFHQ-dog, and LSUN-bedroom.Real imageNicolas Cage Neanderthal Modigliani Pixar Frida

Figure 6: Optimization for smiling on h-space and ϵ-space. (a) Optimizing ∆h t for a sample with smiling results in natural editing while the change due to optimizing ∆ϵ t is relatively small. (b) Applying ∆h t from (a) to other samples yields the same attribute change while ∆ϵ t distorts the images. The result of ∆ϵ t is the best sample we could find.

Figure 7: Linearity of h-space. Linear interpolation and extrapolation on h-space lead to gradual changes even to the unseen intensity and directions. Right side shows the interpolation results by positive scaling of ∆h smiling t . Left side shows the extrapolation results by negative scaling of ∆h smiling t

identical results where i indicates indices of N = 20 random samples.

Figure 8: Linear combination. Combining multiple ∆hs leads to combined attribute changes.

Figure 11: Editing dog to smile. We provide high-resolution results of editing real images in the teaser and more.

Meanwhile, generative adversarial networkGoodfellow et al. (2020) address their latent space for image editing(Ling et al. (2021),Härkönen et al. (2020),Chefer et al. (2021),Shen et al. (2020), Yüksel et al. (2021),Patashnik et al. (2021),Gal et al. (2021),Dai et al. (2019),Xu et al. (

GAN-based latent manipulation methodsPatashnik et al. (2021);Gal et al. (

Figure 13: (a) The reconstructed image by the original DDIM inversion process which is almost indistinguishable to the real image. (b) The result from adding random noise z both on P t and D t . (a) looks identical to (b). The insets of (b, c) depict the full SSIM image from (a). It shows that simply shifting ϵ θ t without Asyrp does not affect the result. (c) Adding z ∼ N to ϵ-space with Asyrp easily degrades image with little semantic change. (d) Adding z ∼ N to h-space with Asyrp yields random semantic change without image degradation. (e) Correlation between image degradation and noise strength in the two spaces.

Figure15: Exhaustive enumeration over the choices for semantic latent space. We show the result of the training data. We observe that the eighth layer (h-space) of the U-net suits the best for the semantic latent space.

Figure 16: Illustration of f t . We use group norm and swish following DDPM.

Figure 17: Quality improvements by non-accelerated sampling with scaled ∆h t . We observe quality improvements by non-accelerated (1000-step) sampling with scaled ∆h t from accelerated (40-step) training described in § 3.4. Please zoom in for detailed comparison.

Figure 19: Average LPIPS(x, P t ) and LPIPS(x, x t ) of 100 samples on all datasets.

Figure 20: Ablation study of quality boosting on the original DDIM process without Asyrp. Our quality boosting enhances fine details and prevents images from being noisy in the original DDIM process. Please zoom in for detailed comparison.

Figure 21: Ablation study of quality boosting with Asyrp. Our quality boosting enhances fine details and prevents images from being noisy. Note that the source of degradation is DDIM process, not Asyrp, confirmed in Figure 20. Please zoom in for detailed comparison.

Figure 22: Analyzing the effect of the hyperparameters The figure shows that the calculated parameter works effectively as the maximum boundary of editing while maintaining the quality of the image.

Figure 24 illustrates generative process. Algorithm 1 and 2 describe training algorithm and inference algorithm of Asyrp, respectively.

Figure29and Figure30show that the effects of mean direction and global direction are quite similar with ∆h t by f t in various attributes. We compute mean direction and global direction from 20 different images.

Figure 32: Result of Asyrp in ImageNet. The result shows that Asyrp works even in ImageNet dataset.

Figure 33: Combination of multiple attributes. The result shows that Asyrp works for combined ∆h of smiling and young.

Details are described in Appendix J.2. Training takes about 20 minutes with three RTX 3090 GPUs. All the images in the figures are not used for training. We set S = 1, 000 for inference. The code is available at

User study with 80 participants. The details are described in § K.1

Prompts, CLIP similarity, t edit , and t boost for all attributes in the experiments.

Input: x 0 (Input image), {f i t } M i=1 (M Neural implicit functions for M attributes), {c i } M i=1 (user defined M scaling coefficients for M attributes), ϵ θ (frozen pretrained model), S f or (# of inversion steps), S gen (# of inference steps), t edit (computed from § 4.1), t boost (computed from § 4.2)1 Function Editor(x 0 , ϵ θ , {c i } M i=1 , S for , S gen , * ): Sgen s=1 s.t τ 1 = 0, τ S edit = t edit , τ Snoise = t boost and τ Sgen = T

DiffusionCLIP 0.813 91.41% 0.760 89.93% 0.888 92.85% 0.811 89.91% 0.661 81.23%

Quantitative evaluation on CelebA-HQ.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (No. 2021-0-00155) 

availability

https://kwonminki.github.io/Asyrp/ 

J.2 TRAINING WITH RANDOM SAMPLING INSTEAD OF THE TRAINING DATASETS

Apparently, for training, inverting real-images can be replaced by random sampling. It refers to using x T ∼ N (0, I) instead of x T = q(x 0 ) • T t=1 q (x t | x t-1 ) where x 0 ∼ p data (x). It allows us to train Asyrp only with the pretrained network and without extra dataset. Using random samples has tradeoff between preservation of contents and possible amount in editing. It take advantages when a target attribute requires large amount changes. We assume that the inversion of real-image is in the long tail of a Gaussian distribution because of realistic background or detailed clothes. On the other hand, random noise is considered to be closer to the mean of the normal, so it is easier to find directions. It can easily bring larger changes but also easily alter the contents. On the contrary, training with inversion shows the opposite property.We train f t with random sampling for attributes whose identity preservation is not important to take advantage of these properties. The rightmost column in Table 3 shows the choices.

K EVALUATION K.1 USER STUDY

We conduct user study to compare the performance of Asyrp and DiffusionCLIP (Kim & Ye, 2021) on Celeba-HQ (Liu et al., 2015) and LSUN-church (Yu et al., 2015) . We use official checkpoints provided by DiffusionCLIP except for some facial attributes whose checkpoint do not exist. We tried our best to tune their hyperparameters following the manual for fair comparison. Example images are shown in Figure 25 -27.We use smiling and sad for in-domain CelebA-HQ attributes, Pixar and Neanderthal for unseen-domain CelebA-HQ attributes, and department store, ancient, and wooden for LSUN-church.In unseen domain and Lsun-church, we use official checkpoints provided by DiffusionCLIP. We also randomly select 8 images for each CelebA-HQ attribute and 12 images for each LSUN-church attribute.We observe that DiffusionCLIP works better in changing the holistic style of images. At the same time, it is short of the ability to bring semantic changes and suffers noisy results and a lack of diversity. The problems would be caused by fine-tuning the whole diffusion model.We use the following questions for the survey. 1) Quality: Which image quality do you think is better? (clear and less noisy) 2) Attribute: Which image do you think is "Attribute(e.g., Smiling) naturally"? 3) Overall: Which image do you think is better considering the above evaluation criteria? (See Figure 29 ). Note that unseen-domains require a longer editing interval with small t edit . Therefore, we conjecture that the consistency of h-space decreases at the end of the generative process. It is supported by additional experiments that L2 distance between ∆h t and global direction gradually increases along with timesteps. We leave a more detailed analysis on h-space at different timesteps as a future work. 

L.2 COMPARE THREE METHODS

In this section, we compare three methods: implicit neural direction f t , optimized ∆h t , and optimized ∆h global .Training time f t ≈ ∆h global < ∆h tWe have to optimize each ∆h t for each time step t. Additionally, it needs specific hyperparameters for each ∆h t , e.g., higher learning rates for larger t. On the contrary, time-consuming for f t is similar to optimizing ∆h global .Quality f t ≈ ∆h t > ∆h global f t and ∆h t , where directions can be obtained for each timestep, have the best quality. As can be shown in Figure 10 , ∆h global is sometimes accompanied by slight differences in hair, etc. ∆h t can be obtained from f t , and ∆h global can be obtained by aggregating ∆h t .We opt to use f t for above three advantages.

M RANDOM SAMPLING

We conduct extra experiments: generating images with target attributes using Asyrp not from inversion but from random Gaussian noises. As a consequence, the generative process can be used for conditional random sampling. We provide the results in Figure 31 . However, it is beyond the scope of this paper. Although we do not focus on these results, Asyrp can be used for conditional sampling. 

