LATENT SPACE, WITH APPLICATIONS TO CYCLEDIFFUSION AND GUIDANCE

Abstract

Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of diffusion models, as well as a reconstructable DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pretrained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. 1

1. INTRODUCTION

Diffusion models (Song & Ermon, 2019; Ho et al., 2020) have achieved unprecedented results in generative modeling and are instrumental to text-to-image models such as DALL•E 2 (Ramesh et al., 2022) . Unlike GANs (Goodfellow et al., 2014) , VAEs (Kingma & Welling, 2014) , and normalizing flows (Dinh et al., 2015) , which have a simple (e.g., Gaussian) latent space, the commonly-adopted formulation of the "latent code" of diffusion models is a sequence of gradually denoised images. This formulation makes the prior distribution of the "latent code" data-dependent, deviating from the idea that generative models are mappings from simple noises to data (Goodfellow et al., 2014) . This paper provides a unified view of generative models of images by reformulating various diffusion models as deterministic maps from a Gaussian latent code z to an image x (Figure 1 , Section 3.1). A question that follows is encoding: how to map an image x to a latent code z. Encoding has been studied for many generative models. For instance, VAEs and normalizing flows have encoders by design, GAN inversion (Xia et al., 2021) builds post hoc encoders for GANs, and deterministic diffusion probabilistic models (DPMs) (Song et al., 2021a; b) build encoders with forward ODEs. However, it is still unclear how to build an encoder for stochastic DPMs such as DDPM (Ho et al., 2020 ), non-deterministic DDIM (Song et al., 2021a) , and latent diffusion models (Rombach et al., 2022) . We propose DPM-Encoder (Section 3.2), a reconstructable encoder for stochastic DPMs. We show that some intriguing consequences emerge from our definition of the latent space of diffusion models and our DPM-Encoder. First, observations have been made that, given two diffusion models, a fixed "random seed" produces similar images (Nichol et al., 2022) . Under our formulation, we formalize "similar images" via an upper bound of image distances. Since the defined latent code contains all randomness during sampling, DPM-Encoder is similar-in-spirit to inferring the "random seed" from real images. Based on this intuition and the upper bound of image distances, we propose CycleDiffusion (Section 3.3), a method for unpaired image-to-image translation using With a simple latent prior, generative models can be guided in a plug-and-play manner by means of energy-based models (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) . Thus, our unification allows unified, plug-and-play guidance for various diffusion models and GANs (Section 3.4), which avoids finetuning the guidance model on noisy images for diffusion models (Dhariwal & Nichol, 2021; Liu et al., 2021) . With the CLIP model and a face recognition model as guidance, we show that diffusion models have broader coverage of low-density sub-populations and individuals (Section 4.3).

2. RELATED WORK

Recent years have witnessed a great progress in generative models, such as GANs (Goodfellow et al., 2014) , diffusion models (Song & Ermon, 2019; Ho et al., 2020; Dhariwal & Nichol, 2021) , VAEs (Kingma & Welling, 2014), normalizing flows (Dinh et al., 2015) , and their hybrid extensions (Sinha et al., 2021; Vahdat et al., 2021; Zhang & Chen, 2021; Kim et al., 2022a) . Previous works have shown that their training objectives are related, e.g., diffusion models as VAEs (Ho et al., 2020; Kingma et al., 2021; Huang et al., 2021) ; GANs and VAEs as KL divergences (Hu et al., 2018) or mutual information with consistency constraints (Zhao et al., 2018) ; a recent attempt (Zhang et al., 2022b) has been made to unify several generative models as GFlowNets (Bengio et al., 2021) . In contrast, this paper unifies generative models as deterministic mappings from Gaussian noises to data (aka implicit models) once they are trained. Generative models with non-Gaussian randomness (Davidson et al., 2018; Nachmani et al., 2021) can be unified as deterministic mappings in similar ways. One of the most fundamental challenges in generative modeling is to design an encoder that is both computationally efficient and invertible. GAN inversion trains an encoder after GANs are pre-trained (Xia et al., 2021) . VAEs and normalizing flows have their encoders by design. Song et al. (2021a; b) studied encoding for ODE-based deterministic diffusion probabilistic models (DPMs). However, it remains unclear how to encode for general stochastic DPMs, and DPM-Encoder fills this gap. Also, CycleDiffusion can be seen as an extension of Su et al. ( 2022)'s DDIB approach to stochastic DPMs. Previous works have formulated plug-and-play guidance of generative models as latent-space energybased models (EBMs) (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) , and our unification makes it applicable to various diffusion models, which are effective for modeling images, audio (Kong et al., 2021 ), videos (Ho et al., 2022; Hoppe et al., 2022 ), molecules (Xu et al., 2022 ), 3D objects (Luo & Hu, 2021 ), and text (Li et al., 2022) . This plug-and-play guidance can provide principled, fine-grained model comparisons of coverage of sub-populations and individuals on the same dataset. A concurrent work observed that fixing both (1) the random seed and (2) the cross-attention map in Transformer-based text-to-image diffusion models results in images with minimal changes (Hertz et al., 2022) . The idea of fixing the cross-attention map is named Cross Attention Control (CAC) in that work, which can be used to edit model-generated images when the random seed is known.



Our codes will be publicly available.



Figure1: Once trained, various types of diffusion models can be reformulated as deterministic maps from latent code z to image x, like GANs, VAEs, and normalizing flows.

