LATENT SPACE, WITH APPLICATIONS TO CYCLEDIFFUSION AND GUIDANCE

Abstract

Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of diffusion models, as well as a reconstructable DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pretrained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. 1

1. INTRODUCTION

Diffusion models (Song & Ermon, 2019; Ho et al., 2020) have achieved unprecedented results in generative modeling and are instrumental to text-to-image models such as DALL•E 2 (Ramesh et al., 2022) . Unlike GANs (Goodfellow et al., 2014) , VAEs (Kingma & Welling, 2014) , and normalizing flows (Dinh et al., 2015) , which have a simple (e.g., Gaussian) latent space, the commonly-adopted formulation of the "latent code" of diffusion models is a sequence of gradually denoised images. This formulation makes the prior distribution of the "latent code" data-dependent, deviating from the idea that generative models are mappings from simple noises to data (Goodfellow et al., 2014) . This paper provides a unified view of generative models of images by reformulating various diffusion models as deterministic maps from a Gaussian latent code z to an image x (Figure 1 , Section 3.1). A question that follows is encoding: how to map an image x to a latent code z. Encoding has been studied for many generative models. For instance, VAEs and normalizing flows have encoders by design, GAN inversion (Xia et al., 2021) builds post hoc encoders for GANs, and deterministic diffusion probabilistic models (DPMs) (Song et al., 2021a; b) build encoders with forward ODEs. However, it is still unclear how to build an encoder for stochastic DPMs such as DDPM (Ho et al., 2020) , non-deterministic DDIM (Song et al., 2021a) , and latent diffusion models (Rombach et al., 2022) . We propose DPM-Encoder (Section 3.2), a reconstructable encoder for stochastic DPMs. We show that some intriguing consequences emerge from our definition of the latent space of diffusion models and our DPM-Encoder. First, observations have been made that, given two diffusion models, a fixed "random seed" produces similar images (Nichol et al., 2022) . Under our formulation, we formalize "similar images" via an upper bound of image distances. Since the defined latent code contains all randomness during sampling, DPM-Encoder is similar-in-spirit to inferring the "random seed" from real images. Based on this intuition and the upper bound of image distances, we propose CycleDiffusion (Section 3.3), a method for unpaired image-to-image translation using Figure 1 : Once trained, various types of diffusion models can be reformulated as deterministic maps from latent code z to image x, like GANs, VAEs, and normalizing flows. our DPM-Encoder. Like the GAN-based UNIT method (Liu et al., 2017) , CycleDiffusion encodes and decodes images using the common latent space. Our experiments show that CycleDiffusion outperforms previous methods based on GANs or diffusion models (Section 4.1). Furthermore, by applying large-scale text-to-image diffusion models (e.g., Stable Diffusion; Rombach et al., 2022) to CycleDiffusion, we obtain zero-shot image-to-image editors (Section 4.2). With a simple latent prior, generative models can be guided in a plug-and-play manner by means of energy-based models (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) . Thus, our unification allows unified, plug-and-play guidance for various diffusion models and GANs (Section 3.4), which avoids finetuning the guidance model on noisy images for diffusion models (Dhariwal & Nichol, 2021; Liu et al., 2021) . With the CLIP model and a face recognition model as guidance, we show that diffusion models have broader coverage of low-density sub-populations and individuals (Section 4.3).

2. RELATED WORK

Recent years have witnessed a great progress in generative models, such as GANs (Goodfellow et al., 2014) , diffusion models (Song & Ermon, 2019; Ho et al., 2020; Dhariwal & Nichol, 2021) , VAEs (Kingma & Welling, 2014) , normalizing flows (Dinh et al., 2015) , and their hybrid extensions (Sinha et al., 2021; Vahdat et al., 2021; Zhang & Chen, 2021; Kim et al., 2022a) . Previous works have shown that their training objectives are related, e.g., diffusion models as VAEs (Ho et al., 2020; Kingma et al., 2021; Huang et al., 2021) ; GANs and VAEs as KL divergences (Hu et al., 2018) or mutual information with consistency constraints (Zhao et al., 2018) ; a recent attempt (Zhang et al., 2022b) has been made to unify several generative models as GFlowNets (Bengio et al., 2021) . In contrast, this paper unifies generative models as deterministic mappings from Gaussian noises to data (aka implicit models) once they are trained. Generative models with non-Gaussian randomness (Davidson et al., 2018; Nachmani et al., 2021) can be unified as deterministic mappings in similar ways. One of the most fundamental challenges in generative modeling is to design an encoder that is both computationally efficient and invertible. GAN inversion trains an encoder after GANs are pre-trained (Xia et al., 2021) . VAEs and normalizing flows have their encoders by design. Song et al. (2021a; b) studied encoding for ODE-based deterministic diffusion probabilistic models (DPMs). However, it remains unclear how to encode for general stochastic DPMs, and DPM-Encoder fills this gap. Also, CycleDiffusion can be seen as an extension of Su et al. (2022)'s DDIB approach to stochastic DPMs. Previous works have formulated plug-and-play guidance of generative models as latent-space energybased models (EBMs) (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) , and our unification makes it applicable to various diffusion models, which are effective for modeling images, audio (Kong et al., 2021) , videos (Ho et al., 2022; Hoppe et al., 2022) , molecules (Xu et al., 2022) , 3D objects (Luo & Hu, 2021), and text (Li et al., 2022) . This plug-and-play guidance can provide principled, fine-grained model comparisons of coverage of sub-populations and individuals on the same dataset. A concurrent work observed that fixing both (1) the random seed and (2) the cross-attention map in Transformer-based text-to-image diffusion models results in images with minimal changes (Hertz et al., 2022) . The idea of fixing the cross-attention map is named Cross Attention Control (CAC) in that work, which can be used to edit model-generated images when the random seed is known. For real images with stochastic DPMs, they generate masks based on the attention map because the random seed is unknown for real images. In Section 4.2, we show that CycleDiffusion and CAC can be combined to improve the structural preservation of image editing. Table 1 : Details of redefining various diffusion models' latent space (Section 3.1).

Latent code z

Deterministic map x = G(z) Stochastic DPMs z := x T ✏ T • • • ✏ 1 x T 1 = µ T (x T , T ) + T ✏ T , x t 1 = µ T (x t , t) + t ✏ t (t < T ), x := x 0 . Deterministic DPMs z := x T (T = T g if with gradient) x T 1 = µ T (x T , T ), x t 1 = µ T (x t , t) (t < T ), x := x 0 . LDM z of G latent z 0 = G latent (z), x = D(z 0 ). DiffAE z := z T x T z 0 = DDIM Z (z T ), x := x 0 = DDIM X (x T , z 0 ). DDGAN z := x T z T ✏ T • • • x T 1 = µ T (x T , z T , T ) + T ✏ T , z 2 ✏ 2 z 1 x t 1 = µ T (x t , z t , t) + t ✏ t (1 < t < T ), x := x 0 = µ T (x 1 , z 1 , 1).

3.1. GAUSSIAN LATENT SPACE FOR DIFFUSION MODELS

Generative models such as GANs, VAEs, and normalizing flows can be seen as a family of implicit models, meaning that they are deterministic maps G : R d ! X from latent codes z to images x. At inference, sampling from the image prior x ⇠ p x (x) is implicitly defined as z ⇠ p z (z), x = G(z). The latent prior p z (z) is commonly chosen to be the isometric Gaussian distribution. In this section, we show how to unify diffusion models into this family. Overview is shown in Figure 1 and Table 1 . Stochastic DPMs: Stochastic DPMs (Ho et al., 2020; Song & Ermon, 2019; Song et al., 2021b; a; Watson et al., 2022) generate images with a Markov chain. Given the mean estimator µ T (see Appendix A) and x T ⇠ N (0, I), the image x := x 0 is generated through x t 1 ⇠ N (µ T (x t , t), diag(foot_1 t )). Using the reparameterization trick, we define the latent code z and the mapping G recursively as z := x T ✏ T • • • ✏ 1 ⇠ N (0, I), x t 1 = µ T (x t , t) + t ✏ t , t = T, . . . , 1, where is concatenation. Here, z has dimension d = d I ⇥ (T + 1), where d I is the image dimension. Deterministic DPMs: Deterministic DPMs (Song et al., 2021a; b; Salimans & Ho, 2022; Liu et al., 2022; Lu et al., 2022; Karras et al., 2022; Zhang & Chen, 2022 ) generate images with the ODE formulation. Given the mean estimator µ T , deterministic DPMs generate x := x 0 via z := x T ⇠ N (0, I), x t 1 = µ T (x t , t), t = T, . . . , 1. (2) Since backpropagation through Eq. ( 2) is costly, we use fewer discretization steps T g when computing gradients. Given the mean estimator µ Tg with number of steps T g , the image x := x 0 is generated as (with gradients) z := x Tg ⇠ N (0, I), x t 1 = µ Tg (x t , t), t = T g , . . . , 1. (3) Latent diffusion models (LDMs): An LDM (Rombach et al., 2022) first uses a diffusion model G latent to compute a "latent code" z 0 = G latent (z), 2 which is then decoded as x = D(z 0 ). Note that G latent is an abstraction of the diffusion models that are already unified above. Diffusion autoencoder (DiffAE): DiffAE (Preechakul et al., 2022) first uses a deterministic DDIM to generate a "latent code" z 0 , 2 which is used as condition for an image-space deterministic DDIM: z := z T x T ⇠ N (0, I), z 0 = DDIM Z (z T ), x := x 0 = DDIM X (x T , z 0 ). DDGAN: DDGAN (Xiao et al., 2022) models each reverse time step t as a GAN conditioned on the output of the previous step. We define the latent code z and generation process G of DDGAN as z := x T z T ✏ T • • • z 2 ✏ 2 z 1 ⇠ N (0, I), x t 1 = µ T (x t , z t , t) + t ✏ t , t = T, . . . , 2, x := x 0 = µ T (x 1 , z 1 , 1). ( ) Algorithm 1: CycleDiffusion for zero-shot image-to-image translation Input: source image x := x 0 ; source text t; target text t; encoding step T es  T 1. Sample noisy image xTes = x Tes ⇠ q(x Tes |x 0 ) for t = T es , . . . , 1 do 2. x t 1 ⇠ q(x t 1 |x t , x 0 ) 3. ✏ t = x t 1 µ T (x t , t|t) / t 4. xt 1 = µ T ( xt , t| t) + t ✏ t Output: x := x0 3.2 DPM-ENCODER: A RECONSTRUCTABLE ENCODER FOR DIFFUSION MODELS In this section, we investigate the encoding problem, i.e., z ⇠ Enc(z|x, G). The encoding problem has been studied for many generative models, and our contribution is DPM-Encoder, an encoder for stochastic DPMs. DPM-Encoder is defined as follows. For each image x := x 0 , stochastic DPMs define a posterior distribution q(x 1:T |x 0 ) (Ho et al., 2020; Song et al., 2021a) . Based on q(x 1:T |x 0 ) and Eq. ( 1), we can directly derive z ⇠ DPMEnc(z|x, G) as (see details in Appendices A and B) x 1 , . . . , x T 1 , x T ⇠ q(x 1:T |x 0 ), ✏ t = x t 1 µ T (x t , t) / t , t = T, . . . , 1, z := x T ✏ T • • • ✏ 2 ✏ 1 . A property of DPM-Encoder is perfect reconstruction, meaning that we have x = G(z) for every z ⇠ Enc(z|x, G). A proof by induction is provided in Appendix B.

3.3. CYCLEDIFFUSION: IMAGE-TO-IMAGE TRANSLATION WITH DPM-ENCODER

Given two stochastic DPMs G 1 and G 2 that model two distributions D 1 and D 2 , several researchers and practitioners have found that sampling with the same "random seed" leads to similar images (Nichol et al., 2022) . To formalize "similar images", we provide an upper bound of image distances based on assumptions about the trained DPMs, shown at the end of this subsection. Based on this finding, we propose a simple unpaired image-to-image translation method, CycleDiffusion. Given a source image x 2 D 1 , we use DPM-Encoder to encode it as z and then decode it as x = G 2 (z): z ⇠ DPMEnc(z|x, G 1 ), x = G 2 (z). We can also apply CycleDiffusion to text-to-image diffusion models by defining D 1 and D 2 as image distributions conditioned on two texts. Let G t be a text-to-image diffusion model conditioned on text t. Given a source image x, the user writes two texts: a source text t describing the source image x and a target text t describing the target image x to be generated. We can then perform zero-shot image-to-image editing via (zero-shot means that the model has never been trained on image editing) z ⇠ DPMEnc(z|x, G t ), x = G t(z). ( ) Inspired by the realism-faithfulness tradeoff in SDEdit (Meng et al., 2022) , we can truncate z towards a specified encoding step T es  T . The algorithm of CycleDiffusion is shown in Algorithm 1. Why does a fixed z leads to similar images? We illustrate with text-to-image diffusion models. Suppose the text-to-image model has the following two properties: 1. Conditioned on the same text, similar noisy images lead to similar enough mean predictions. Formally, µ T (x t , t|t) is K t -Lipschitz, i.e., kµ T (x t , t|t) µ T ( xt , t|t)k  K t kx t xt k. 2. Given the same image, the two texts lead to similar predictions. Formally, kµ T (x t , t|t) µ T (x t , t| t)k  S t . Intuitively, a smaller difference between t and t gives us a smaller S t . Let B t be the upper bound of kx t xt k 2 at time step t when the same latent code z is used for sampling (i.e., x 0 = G t (z) and x0 = G t(z)). We have B s T = 0 because kx T xt k 2 = 0, and B s 0 is the upper bound for the generated images kx xk 2 . The upper bound B t can be propagated through time, from T to 0. Specifically, by combining the above two properties, we have B t 1  (K t + 1)B t + S t

3.4. UNIFIED PLUG-AND-PLAY GUIDANCE FOR GENERATIVE MODELS

Prior works showed that guidance for generative models can be achieved in the latent space (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) . Specifically, given a condition C, one can define the guided image distribution as an energy-based model (EBM): p(x|C) / p x (x)e C E(x|C) . Sampling for x ⇠ p(x|C) is equivalent to z ⇠ p z (z|C), x = G(z), where p(z|C) / p z (z)e C E(G(z)|C) . Examples of the energy function E(x|C) are provided in Section 4.3. To sample z ⇠ p(z|C), one can use any model-agnostic samplers. For example, Langevin dynamics (Welling & Teh, 2011) starts from z h0i ⇠ N (0, I) and samples z := z hni iteratively through  z hk+1i = z hki + 2 r z ⇣ log p z (z hki ) E G(z hki )|C ⌘ + p ! hki , ! hki ⇠ N (0, I). (10)

4. EXPERIMENTS

This section provides experimental validation of the proposed work. Section 4.1 shows how CycleDiffusion achieves competitive results on unpaired image-to-image translation benchmarks. Section 4.2 provides a protocol for what we call zero-shot image-to-image translation; CycleDiffusion outperforms several image-to-image translation baselines that we re-purposed for this new task. Section 4.3 shows how diffusion models and GANs can be guided in a unified, plug-and-play formulation.

4.1. CYCLEDIFFUSION FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION

Given two unaligned image domains, unpaired image-to-image translation aims at mapping images in one domain to the other. We follow setups from previous works whenever possible, as detailed below. Following previous work (Park et al., 2020; Zhao et al., 2022) , we conducted experiments on the test set of AFHQ (Choi et al., 2020) with resolution 256 ⇥ 256 for Cat ! Dog and Wild ! Dog. For each source image, each method should generate a target image with minimal changes. Since CycleDiffusion sometimes generates noisy outputs, we used T sdedit steps of SDEdit for denoising. When T = 1000, we set T sdedit = 100 for Cat ! Dog and T sdedit = 125 for Wild ! Dog. Metrics: To evaluate realism, we reported Frechet Inception Distance (FID; Heusel et al., 2017) and Kernel Inception Distance (KID; Bińkowski et al., 2018) between the generated and target images. To evaluate faithfulness, we reported Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM; Wang et al., 2004) between each generated image and its source image. Baselines: We compared CycleDiffusion with previous state-of-the-art unpaired image-to-image translation methods: CUT (Park et al., 2020) , ILVR (Choi et al., 2021) , SDEdit (Meng et al., 2022) , and EGSDE (Zhao et al., 2022) . CUT is based on GAN, and the others use diffusion models. Pre-trained diffusion models: ILVR, SDEdit, and EGSDE only need the diffusion model trained on the target domain, and we followed them to use the pre-trained model from Choi et al. (2021) for Dog. CycleDiffusion needs diffusion models on both domains, so we trained them on Cat and Wild. Seen in Table 2 are the results. CycleDiffusion has the best realism (i.e., FID and KID). There is a mismatch between the faithfulness metrics (i.e., PSNR and SSIM), and note that SSIM is much better correlated with human perception than PSNR (Wang et al., 2004) . Among all diffusion model-based methods, CycleDiffusion achieves the highest SSIM. Figure 2 displays some image samples from CycleDiffusion, showing that our method can change the domain while preserving local details such as the background, lighting, pose, and overall color the animal. 

4.2. TEXT-TO-IMAGE DIFFUSION MODELS CAN BE ZERO-SHOT IMAGE-TO-IMAGE EDITORS

This section provides experiments for zero-shot image editing. We curated a set of 150 tuples (x, t, t) for this task, where x is the source image, t is the source text (e.g., "an aerial view of autumn scene." in Figure 3 second row on the right), and t is the target text (e.g., "an aerial view of winter scene."). The generated image is denoted as x. We also demonstrate that CycleDiffusion can be combined with the Cross Attention Control (Hertz et al., 2022) to further preserve the image structure. Metrics: To evaluate faithfulness to the source image, we reported PSNR and SSIM. To evaluate authenticity to the target text, we reported the CLIP score S CLIP ( x| t) = cos ⌦ CLIP img ( x), CLIP text ( t) ↵ , where the CLIP embeddings are normalized. We note a trade-off between PSNR/SSIM and S CLIP : by copying the source image we get high PSNR/SSIM but low S CLIP , and by ignoring the source image (e.g., by directly generating images conditioned on the target text) we get high S CLIP but low PSNR/SSIM. To address this trade-off, we also reported the directional CLIP score (Patashnik et al., 2021) (both CLIP embeddings and the embedding differences are normalized): S D-CLIP ( x|x, t, t) = cos D CLIP img ( x) CLIP img (x), CLIP text ( t) CLIP text (t) E . ( ) Baselines: The baselines include SDEdit (Meng et al., 2022) and DDIB (Su et al., 2022) . We used the same hyperparameters for the baselines and CycleDiffusion whenever possible (e.g., the number of diffusion steps, the strength of classifier-free guidance; see Appendix C). Pre-trained text-to-image diffusion models: To investigate how the zero-shot performance changes with data size, data quality, and training details, we used the following models: (1) LDM-400M, a 1.45B-parameter model trained on LAION-400M (Schuhmann et al., 2021) , (2) SD-v1-1, a 0.98Bparameter Stable Diffusion model trained on LAION-5B (Schuhmann et al., 2022) , (3) SD-v1-4, finetuned from SD-v1-1 for improved aesthetics and classifier-free guidance. Results: Table 3 shows the results for zero-shot image-to-image translation. CycleDiffusion excels at being faithful to the source image (i.e., PSNR and SSIM); by contrast, SDEdit and DDIB have comparable authenticity to the target text (i.e., S CLIP ), but their outputs are much less faithful. For all methods, we find that the pre-trained weights SD-v1-1 and SD-v1-4 have better faithfulness than LDM-400M. Figure 3 provides samples from CycleDiffusion, demonstrating that CycleDiffusion achieves meaningful edits that span (1) replacing objects, (2) adding objects, (3) changing styles, and (4) modifying attributes. See Figure 7 (Appendix E) for qualitative comparisons with the baselines. CycleDiffusion + Cross Attention Control: Besides fixing the random seed, Hertz et al. (2022) shows that fixing the cross attention map (i.e., Cross Attention Control, or CAC) further improves the similarity between synthesized images. CAC is applicable to CycleDiffusion: in Algorithm 1, we can apply the attention map of µ T (x t , t|t) to µ T (x t , t| t). However, we cannot apply it to all samples because CAC puts requirements on the difference between t and t. Figure 4 shows that CAC helps CycleDiffusion when the intended structural change is small. For instance, when the intended change is color but not shape (left), CAC helps CycleDiffusion preserve the background; when the intended change is horse ! elephant, CAC makes the generated elephant to look more like a horse in shape. . Image samples and more analyses are in Figure 6 and Appendix G. When the guidance becomes complex, diffusion models surpass GANs.

4.3. UNIFIED PLUG-AND-PLAY GUIDANCE FOR DIFFUSION MODELS AND GANS

Previous methods for conditional sampling from (aka guiding) diffusion models require training the guidance model on noisy images (Dhariwal & Nichol, 2021; Liu et al., 2021) , which deviates from the idea of plug-and-play guidance by leveraging the simple latent prior of generative models (Nguyen et al., 2017) . In contrast, our definition of the Gaussian latent space of different diffusion models allows for unified plug-and-play guidance of diffusion models and GANs. It facilitates principled comparisons over sub-populations and individuals when models are trained on the same dataset. We used the text t to specify sub-population. For instance, a photo of baby represents the baby subpopulation in the domain of human faces. We instantiate the energy in Section 3.4 as E CLIP (x|t) = 1 L P L l=1 ⇣ 1 cos ⌦ CLIP img DiffAug l (x) , CLIP text (t) ↵ ⌘ , where DiffAug l stands for differentiable augmentation (Zhao et al., 2020) that mitigates the adversarial effect, and we sample from the energybased distribution using Langevin dynamics in Eq. ( 10) with n = 200, = 0.05. We enumerated the guidance strength (i.e., the coefficient C in Section 3.4) CLIP 2 {100, 300, 500, 700, 1000}. For evaluation, we reported (1 E CLIP (x|t)) averaged over 256 samples. This metric quantifies whether the sampled images are consistent with the specified text t. Figure 5 plots models with pre-trained weights on FFHQ (Karras et al., 2019) (citations in Table 5 , Appendix G). In Figure 6 , we visualize samples for SN-DDPM and DDGAN trained on CelebA. We find that diffusion models outperform 2D/3D GANs for complex text, and different models represent the same sub-population differently. Broad coverage of individuals is an important aspect of the personalized use of generative models. To analyze this coverage, we guide different models to generate images that are close to a reference x r in the identity (ID) space modeled by the IR-SE50 face embedding model (Deng et al., 2019) , denoted as R. Given an ID reference image x r , we instantiated the energy defined in Section 3.4 as E ID (x|x r ) = 1 cos ⌦ R(x), R(x r ) ↵ with strength ID = 2500 (i.e., C in Section 3.4). For sampling, we used Langevin dynamics detailed in Eq. ( 10) with n = 200 and = 0.05. To measure ID similarity to the reference image x r , we reported cos ⌦ R(x), R(x r ) ↵ , averaged over 256 samples. In Table 4 , we report the performance of StyleGAN2, StyleGAN-XL, GIRAFFE-HD, EG3D, LDM-DDIM, DDGAN, and DiffAE. DDGAN is trained on CelebAHQ, while others are trained on FFHQ. We find that diffusion models have much better coverage of individuals than 2D/3D GANs. Among diffusion models, deterministic LDM-DDIM (⌘ = 0) achieves the best identity guidance performance. We provide image samples of identity guidance in Figure 9 (Appendix G). 

5. CONCLUSIONS AND DISCUSSION

This paper provides a unified view of pre-trained generative models by reformulating the latent space of diffusion models. While this reformulation is purely definitional, we show that it allows us to use diffusion models in similar ways as CycleGANs (Zhu et al., 2017) and GANs. Our CycleDiffusion achieves impressive performance on unpaired image-to-image translation (with two diffusion models trained on two domains independently) and zero-shot image-to-image translation (with text-to-image diffusion models). Our definition of latent code also allows diffusion models to be guided in the same way as GANs (i.e., plug-and-play, without finetuning on noisy images), and results show that diffusion models have broader coverage of sub-populations and individuals than GANs. Besides the interesting results, it is worth noting that this paper raised more than provided answers. We have provided a formal analysis of the common latent space of stochastic DPMs via the bounded distance between images (Section 3.3), but it still needs further study. Notably, Khrulkov & Oseledets (2022) and Su et al. (2022) studied deterministic DPMs based on optimal transport. Furthermore, efficient plug-and-play guidance for stochastic DPMs on high-resolution images with many diffusion steps still remains open. These topics can be further explored in future studies.



Our codes will be publicly available. Quotation marks stand for "latent code" in the cited papers, different from our latent code z in Section 3.1.



Figure 2: Unpaired image-to-image translation (Cat ! Dog, Wild ! Dog) with CycleDiffusion.

Figure 3: CycleDiffusion for zero-shot image editing. Source images x are displayed with a purple margin; the other images are the generated x. Within each pair of source and target texts, overlapping text spans are marked in purple in the source text and abbreviated as [. . .] in the target text.

Figure 4: Cross Attention Control (CAC;Hertz et al., 2022) helps CycleDiffusion when the intended structural change is small. For instance, when the intended change is color but not shape (left), CAC helps CycleDiffusion preserve the background; when the intended change is horse ! elephant, CAC makes the generated elephant look more like a horse in shape.

Figure 5: Unified plug-and-play guidance for diffusion models and GANs with text and CLIP. The text description used in each plot is a photo of ⇥ ⇤. Image samples and more analyses are in Figure6and Appendix G. When the guidance becomes complex, diffusion models surpass GANs.

Figure 6: Sampling sub-populations from pre-trained generative models. Notations follow Figure 5.

Quantitative comparison for unpaired image-to-image translation methods. Methods in the second block use the same pre-trained diffusion model in the target domain. Results of CUT, ILVR, SDEdit, and EGSDE are fromZhao et al. (2022). Best results using diffusion models are in bold. CycleDiffusion has the best FID and KID among all methods and the best SSIM among methods with diffusion models. Note that it has been shown that SSIM is much better correlated with human visual perception than squared distance-based metrics such as L 2 and PSNR(Wang et al., 2004).

Zero-shot image editing. We did not use fixed hyperparameters, and neither did we plot the trade-off curve. The reason is that every input can have its best hyperparameters and even random seed (such per-sample tuning for Stable Diffusion is quite common nowadays). Instead, for each input, we ran 15 random trials for each hyperparameter and report the one with the highest S D-CLIP .

