DIFFUSION MODELS ALREADY HAVE A SEMANTIC LATENT SPACE

Abstract

Diffusion models achieve outstanding generative performance in various domains. Despite their great success, they lack semantic latent space which is essential for controlling the generative process. To address the problem, we propose asymmetric reverse process (Asyrp) which discovers the semantic latent space in frozen pretrained diffusion models. Our semantic latent space, named h-space, has nice properties to accommodate semantic image manipulation: homogeneity, linearity, robustness, and consistency across timesteps. In addition, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Our method is applicable to various architectures (DDPM++, iD-DPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUNbedroom, and METFACES). Project page:

1. INTRODUCTION

In image synthesis, diffusion models have advanced to achieve state-of-the-art performance regarding quality and mode coverage since the introduction of denoising diffusion probabilistic models (Ho et al., 2020) . They disrupt images by adding noise through multiple steps of forward process and generate samples by progressive denoising through multiple steps of reverse (i.e., generative) process. Since their deterministic version provides nearly perfect reconstruction of original images (Song et al., 2020a) , they are suitable for image editing, which renders target attributes on the real images. However, simply editing the latent variables (i.e., intermediate noisy images) causes degraded results (Kim & Ye, 2021) . Instead, they require complicated procedures: providing guidance in the reverse process or finetuning models for an attribute. Figure 1(a-c ) briefly illustrates the existing approaches. Image guidance mixes the latent variables of the guiding image with unconditional latent variables (Choi et al., 2021; Lugmayr et al., 2022; Meng et al., 2021) . Though it provides some control, it is ambiguous to specify which attribute to reflect among the ones in the guide and the unconditional result, and it lacks intuitive control for the magnitude of change. Classifier guidance manipulates images by imposing gradients of a classifier on the latent variables in the reverse process to match the target class (Dhariwal & Nichol, 2021; Avrahami et al., 2022; Liu et al., 2021) . It requires training an extra classifier for the latent variables, i.e., noisy images. Furthermore, computing gradients through the classifier during sampling is costly. Finetuning the whole model can steer the resulting images to the target attribute without the above problems (Kim & Ye, 2021) . Still, it requires multiple models to reflect multiple descriptions. On the other hand, generative adversarial networks (Goodfellow et al., 2020) inherently provide straightforward image editing in their latent space. Given a latent vector for an original image, we can find the direction in the latent space that maximizes the similarity of the resulting image with a target description in CLIP embedding (Patashnik et al., 2021) . The latent direction found on one image leads to the same manipulation of other images. However, given a real image, finding its exact latent vector is often challenging and produces unexpected appearance changes. It would allow admirable image editing if the diffusion models with the nearly perfect inversion property have such a semantic latent space. Preechakul et al. ( 2022) introduces an additional input to the reverse diffusion process: a latent vector from an original image embedded by an extra encoder. This latent vector contains the semantics to condition the process. However, it requires training from scratch and does not match with pretrained diffusion models. In this paper, we propose an asymmetric reverse process (Asyrp) which discovers the semantic latent space of a frozen diffusion model such that modifications in the space edits attributes of the original images. Our semantic latent space, named h-space, has the properties necessary for editing applications as follows. The same shift in this space results in the same attribute change in all images. Linear changes in this space lead to linear changes in attributes. The changes do not degrade the quality of the resulting images. The changes throughout the timesteps are almost identical to each other for a desired attribute change. Figure 1(d ) illustrates some of these properties and § 5.3 provides detailed analyses. To the best of our knowledge, it is the first attempt to discover the semantic latent space in the frozen pretrained diffusion models. Spoiler alert: our semantic latent space is different from the intermediate latent variables in the diffusion process. Moreover, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Extensive experiments demonstrate that our method is generally applicable to various architectures (DDPM++, iDDPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, and METFACES).

2. BACKGROUND

We briefly describe essential backgrounds. The rest of the related work is deferred to Appendix A.

2.1. DENOISING DIFFUSION PROBABILITY MODEL (DDPM)

DDPM is a latent variable model that learns a data distribution by denoising noisy images (Ho et al., 2020) . The forward process diffuses the data samples through Gaussian transitions parameterized with a Markov process: q (x t | x t-1 ) = N x t ; 1 -β t x t-1 , β t I = N α t α t-1 x t-1 , 1 - α t α t-1 I , where {β t } T t=1 is the variance schedule and α t = t s=1 (1 -β s ). Then the reverse process becomes p θ (x 0:T ) := p (x T ) T t=1 p θ (x t-1 | x t ), starting from x T ∼ N (0, I) with noise predictor ϵ θ t : x t-1 = 1 √ 1 -β t x t - β t √ 1 -α t ϵ θ t (x t ) + σ t z t , where z t ∼ N (0, I) and σ 2 t is a variance of the reverse process which is set to σ 2 t = β t by DDPM.



Figure 1: Manipulation approaches for diffusion models. (a) Image guidance suffers ambiguity while controlling the generative process. (b) Classifier guidance requires an extra classifier, is hardly editable, degrades quality, or alters the content. (c) DiffusionCLIP requires fine-tuning the whole model. (d) Our method discovers a semantic latent space of a frozen diffusion model.

availability

https://kwonminki.github.io/Asyrp/ 

