DIFFUSION MODELS ALREADY HAVE A SEMANTIC LATENT SPACE

Abstract

Diffusion models achieve outstanding generative performance in various domains. Despite their great success, they lack semantic latent space which is essential for controlling the generative process. To address the problem, we propose asymmetric reverse process (Asyrp) which discovers the semantic latent space in frozen pretrained diffusion models. Our semantic latent space, named h-space, has nice properties to accommodate semantic image manipulation: homogeneity, linearity, robustness, and consistency across timesteps. In addition, we introduce a principled design of the generative process for versatile editing and quality boosting by quantifiable measures: editing strength of an interval and quality deficiency at a timestep. Our method is applicable to various architectures (DDPM++, iD-DPM, and ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUNbedroom, and METFACES). Project page:

1. INTRODUCTION

In image synthesis, diffusion models have advanced to achieve state-of-the-art performance regarding quality and mode coverage since the introduction of denoising diffusion probabilistic models (Ho et al., 2020) . They disrupt images by adding noise through multiple steps of forward process and generate samples by progressive denoising through multiple steps of reverse (i.e., generative) process. Since their deterministic version provides nearly perfect reconstruction of original images (Song et al., 2020a) , they are suitable for image editing, which renders target attributes on the real images. However, simply editing the latent variables (i.e., intermediate noisy images) causes degraded results (Kim & Ye, 2021). Instead, they require complicated procedures: providing guidance in the reverse process or finetuning models for an attribute. Figure 1(a-c ) briefly illustrates the existing approaches. Image guidance mixes the latent variables of the guiding image with unconditional latent variables (Choi et al., 2021; Lugmayr et al., 2022; Meng et al., 2021) . Though it provides some control, it is ambiguous to specify which attribute to reflect among the ones in the guide and the unconditional result, and it lacks intuitive control for the magnitude of change. Classifier guidance manipulates images by imposing gradients of a classifier on the latent variables in the reverse process to match the target class (Dhariwal & Nichol, 2021; Avrahami et al., 2022; Liu et al., 2021) . It requires training an extra classifier for the latent variables, i.e., noisy images. Furthermore, computing gradients through the classifier during sampling is costly. Finetuning the whole model can steer the resulting images to the target attribute without the above problems (Kim & Ye, 2021) . Still, it requires multiple models to reflect multiple descriptions. On the other hand, generative adversarial networks (Goodfellow et al., 2020) inherently provide straightforward image editing in their latent space. Given a latent vector for an original image, we can find the direction in the latent space that maximizes the similarity of the resulting image with a target description in CLIP embedding (Patashnik et al., 2021) . The latent direction found on one * corresponding author 1

availability

https://kwonminki.github.io/Asyrp/ 

