LATENT SPACE, WITH APPLICATIONS TO CYCLEDIFFUSION AND GUIDANCE

Abstract

Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of diffusion models, as well as a reconstructable DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pretrained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. 1

1. INTRODUCTION

Diffusion models (Song & Ermon, 2019; Ho et al., 2020) have achieved unprecedented results in generative modeling and are instrumental to text-to-image models such as DALL •E 2 (Ramesh et al., 2022) . Unlike GANs (Goodfellow et al., 2014 ), VAEs (Kingma & Welling, 2014) , and normalizing flows (Dinh et al., 2015) , which have a simple (e.g., Gaussian) latent space, the commonly-adopted formulation of the "latent code" of diffusion models is a sequence of gradually denoised images. This formulation makes the prior distribution of the "latent code" data-dependent, deviating from the idea that generative models are mappings from simple noises to data (Goodfellow et al., 2014) . This paper provides a unified view of generative models of images by reformulating various diffusion models as deterministic maps from a Gaussian latent code z to an image x (Figure 1 , Section 3.1). A question that follows is encoding: how to map an image x to a latent code z. Encoding has been studied for many generative models. For instance, VAEs and normalizing flows have encoders by design, GAN inversion (Xia et al., 2021) builds post hoc encoders for GANs, and deterministic diffusion probabilistic models (DPMs) (Song et al., 2021a;b) build encoders with forward ODEs. However, it is still unclear how to build an encoder for stochastic DPMs such as DDPM (Ho et al., 2020 ), non-deterministic DDIM (Song et al., 2021a) , and latent diffusion models (Rombach et al., 2022) . We propose DPM-Encoder (Section 3.2), a reconstructable encoder for stochastic DPMs. We show that some intriguing consequences emerge from our definition of the latent space of diffusion models and our DPM-Encoder. First, observations have been made that, given two diffusion models, a fixed "random seed" produces similar images (Nichol et al., 2022) . Under our formulation, we formalize "similar images" via an upper bound of image distances. Since the defined latent code contains all randomness during sampling, DPM-Encoder is similar-in-spirit to inferring the "random seed" from real images. Based on this intuition and the upper bound of image distances, we propose CycleDiffusion (Section 3.3), a method for unpaired image-to-image translation using 1 Our codes will be publicly available. 1

