DUAL DIFFUSION IMPLICIT BRIDGES FOR IMAGE-TO-IMAGE TRANSLATION

Abstract

Common image-to-image translation methods rely on joint training over data from both source and target domains. The training process requires concurrent access to both datasets, which hinders data separation and privacy protection; and existing models cannot be easily adapted for translation of new domain pairs. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models, that circumvents training on domain pairs. Image translation with DDIBs relies on two diffusion models trained independently on each domain, and is a two-step process: DDIBs first obtain latent encodings for source images with the source diffusion model, and then decode such encodings using the target model to construct target images. Both steps are defined via ordinary differential equations (ODEs), thus the process is cycle consistent only up to discretization errors of the ODE solvers. Theoretically, we interpret DDIBs as concatenation of source to latent, and latent to target Schrödinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method. Experimentally, we apply DDIBs on synthetic and high-resolution image datasets, to demonstrate their utility in a wide variety of translation tasks and their inherent optimal transport properties.

1. INTRODUCTION

Transferring images from one domain to another while preserving the content representation is an important problem in computer vision, with wide applications that span style transfer (Xu et al., 2021; Sinha et al., 2021) and semantic segmentation (Li et al., 2020) . In tasks such as style transfer, it is usually difficult to obtain paired images of realistic scenes and their artistic renditions. Consequently, unpaired translation methods are particularly relevant, since only the datasets, and not the one-toone correspondence between image translation pairs, are required. Common methods on unpaired translation are based on generative adversarial networks (GANs, Goodfellow et al. (2014) ; Zhu et al. (2017) ) or normalizing flows (Grover et al., 2020) . Training such models typically involves minimizing an adversarial loss between a specific pair of source and target datasets. While capable of producing high-quality images, these methods suffer from a severe drawback in their adaptability to alternative domains. Concretely, a translation model on a source-target pair is trained specifically for this domain pair. Provided a different pair, existing, bespoke models cannot be easily adapted for translation. If we were to do pairwise translation among a set of domains, the total number of models needed is quadratic in the number of domains -an unacceptable computational cost in practice. One alternative is to find a shared domain that connects to each source / target domains as in StarGANs (Choi et al., 2018) . However, the shared domain needs to be carefully chosen a priori; if the shared domain contains less information than the target domain (e.g. sketches v.s. photos), then it creates an unwanted information bottleneck between the source and target domains. An additional disadvantage of existing models resides in their lack of privacy protection of the datasets: training a translation model requires access to both datasets simultaneously. Such setting may be inconvenient or impossible, when data providers are reluctant about giving away their data; or for certain privacy-sensitive applications such as medical imaging. For example, quotidian hospital usage may require translation of patients' X-ray and MRI images taken from machines in other hospitals. Most existing methods will fail in such scenarios, as joint training requires aggregating confidential imaging data across hospitals, which may violate patients' privacy. Figure 1 : Dual Diffusion Implicit Bridges: DDIBs leverage two ODEs for image translation. Given a source image x (s) , the source ODE runs in the forward direction to convert it to the latent x (l) , while the target, reverse ODE then constructs the target image x (t) . (Top) Illustration of the DDIBs idea between two one-dimensional distributions. (Bottom) DDIBs from a tiger to a cat using a pretrained conditional diffusion model. In this paper, we seek to mitigate both problems of existing image translation methods. We present Dual Diffusion Implicit Bridges (DDIBs), an image-to-image translation method inspired by recent advances in diffusion models (Song et al., 2020a; b) , that decouples paired training, and empowers the domain-specific diffusion models to stay applicable in other pairs wherever the domain appears again as the source or the target. Since the training process now concentrates on one dataset at a time, DDIBs can also be applied in federated settings, and not assume access to both datasets during model training. As a result, owners of domain data can effectively preserve their data privacy. Specifically, DDIBs are developed based on the method known as denoising diffusion implicit models (DDIMs, Song et al. (2020a) ). DDIMs invent a particular parameterization of the diffusion process, that creates a smooth, deterministic and reversible mapping between images and their latent representations. This mapping is captured using the solution to a so-called probability flow (PF) ordinary differential equation (ODE) that forms the cornerstone of DDIBs. Translation with DDIBs on a source-target pair requires two different PF ODEs: the source PF ODE converts input images to the latent space; while the target ODE then synthesizes images in the target domain. Crucially, trained diffusion models are specific to the individual domains, and rely on no domain pairing information. Effectively, DDIBs make it possible to save a trained model of a certain domain for future use, when it arises as the source or target in a new pair. Pairwise translation with DDIBs requires only a linear number of diffusion models (which can be further reduced with conditional models (Dhariwal & Nichol, 2021) ), and training does not require scanning both datasets concurrently. Theoretically, we analyze the DDIBs translation process to highlight two important theoretical properties. First, the probability flow ODEs in DDIBs, in essence, comprise the solution of a special Schrödinger Bridge Problem (SBP) with linear or degenerate drift (Chen et al., 2021a) , between the data and the latent distributions. This justification of DDIBs from an optimal transport viewpoint that alternative translation methods lack serves as a theoretical advantage of our method, as DDIBs are the most OT-efficient translation procedure while alternate methods may not be. Second, DDIBs guarantee exact cycle consistency: translating an image to and back from the target space reinstates the original image, only up to discretization errors introduced in the ODE solvers. Experimentally, we first present synthetic experiments on two-dimensional datasets to demonstrate DDIBs' cycle-consistency property. We then evaluate our method on a variety of image modalities, with qualitative and quantitative results: we validate its usage in example-guided color transfer, paired image translation, and conditional ImageNet translation. These results establish DDIBs as a scalable, theoretically rigorous addition to the family of unpaired image translation methods.

2.1. SCORE-BASED GENERATIVE MODELS (SGMS)

While our actual implementation utilizes DDIMs, we first briefly introduce the broader family of models known as score-based generative models. Two representative models of this family are score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019) and denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020) . Both methods are contained within the framework of Stochastic Differential Equations (SDEs) proposed in Song et al. (2020b) . Song et al. (2020b) ; Anderson (1982) use a forward and a corresponding backward SDE to describe general diffusion and the reversed, generative processes:

Stochastic Differential Equation (SDE) Representation

dx = f (x, t) dt + g(t) dw , dx = [f -g 2 ∇ x log p t (x)] dt + g(t) dw (1) where w is the standard Wiener process, f (x, t) is the vector-valued drift coefficient, g(t) is the scalar diffusion coefficient, and ∇ x log p t (x) is the score function of the noise perturbed data distribution (as defined by the forward SDE with initial condition p 0 (x) being the data distribution). At the endpoints t = {0, 1}, the forward Eq. ( 1) admits the data distribution p 0 and the easy-to-sample prior p 1 as the boundary distributions. Within this framework, the SMLD method can be described using a Variance-Exploding (Song et al., 2020a) . Probability Flow ODE Any diffusion process can be represented by a deterministic ODE that carries the same marginal densities as the diffusion process throughout its trajectory. This ODE is termed the probability flow (PF) ODE (Song et al., 2020b) . PF ODEs enable uniquely identifiable encodings (Song et al., 2020b) of data, and are central to DDIBs as we solve these ODEs for forward and reverse conversion between data and their latents. For the forward SDE introduced in Eq. ( 1), the equivalent PF ODE holds the following form: dx = f (x, t) - 1 2 g(t) 2 ∇ x log p t (x) dt which follows immediately from the SDEs given the score function. In practice, we use θparameterized score networks s t,θ ≈ ∇ x log p t (x) to approximate the score function. Training such networks relies on a variational lower bound, described in Ho et al. (2020) and in Appendix B. We may then employ numerical ODE solvers to solve the above ODE and construct x at different times. Empirically, it has been demonstrated that SGMs have relatively low discretization errors when reconstructing x at t = 0 via ODE solvers (Song et al., 2020a) . For conciseness, we use v θ = dx / dt to denote the θ-parameterized velocity field (as defined from Eq. ( 2), where we replace ∇ x log p t (x) with s t,θ ), and use the symbol ODESolve to denote the mapping from x(t 0 ) to x(t 1 ): ODESolve(x(t 0 ); v θ , t 0 , t 1 ) = x(t 0 ) + t1 t0 v θ (t, x(t)) dt , which allows us to abstract away the exact model (be it a score-based or a diffusion model), or the integrator used. In our experiments, we implement the ODE solver in DDIMs (Song et al., 2020a ) (Appendix B); while we acknowledge other available ODE solvers that are usable within our framework, such as the DPM-solver (Lu et al., 2022) , the Exponential Integrator (Zhang & Chen, 2022) , and the second-order Heun solver (Karras et al., 2022) .

2.2. SCHR ÖDINGER BRIDGE PROBLEM (SBP)

Our analysis shows that DDIBs are Schrödinger Bridges (Chen et al., 2016; Léonard, 2013) between distributions. Let Ω = C([0, 1]; R n ) be the path space of R n -valued continuous functions over the time interval [0, 1]; and D(p 0 , p 1 ) be the set of distributions over Ω , with marginals p 0 , p 1 at time t = 0, t = 1, respectively. Given a prior reference measure Wfoot_0 , the well-known Schrödinger Bridge Problem (SBP) seeks the most probable evolution across time t between the marginals p 0 and p 1 : Problem 1 (Schrödinger Bridge Problem). With prescribed distributions p 0 , p 1 and a reference measure W as the prior, the SBP finds a distribution from D(p 0 , p 1 ) that minimizes its KL-divergence to W : P SBP := arg min{D KL (P ∥W ) | P ∈ D(p 0 , p 1 )}. Algorithm 1 High-level Pseudo-code for DDIBs Input: data sample from source domain x (s) ∼ p s (x), source model v (s) θ , target model v (t) θ . Output: x (t) , the result in the target domain. x (l) = ODESolve(x (s) ; v (s) θ , 0, 1) // obtain latent code from source domain data x (t) = ODESolve(x (l) ; v (t) θ , 1, 0) // obtain target domain data from latent code return x (t) The minimizer, P SBP , is dubbed the Schrödinger Bridge between p 0 and p 1 over prior W . The SBP has connections to the Monge-Kantorovich (MK) optimal transport problem (Chen et al., 2021b) . While the basic MK problem seeks the cost-minimizing plan to transport masses between distributions, the SBP incorporates an additional entropy term (for details, see Page 61 of Peyré et al. (2019) ) . Relationship Between SBPs and SGMs Chen et al. (2021a) establishes connections between SGMs and SBPs. In summary, SGMs are implicit optimal transport models, corresponding to SBPs with linear or degenerate drifts. General SBPs additionally accept fully nonlinear diffusion. To formalize this observation, the authors first establish similar forward and backward SDEs for SBPs: dx = [f + g 2 ∇ x log Φ t (x)] dt + g(t) dw , dx = [f -g 2 ∇ x log Φt (x)] dt + g(t) dw (4) where Φ, Φ are the Schrödinger factors that satisfy density factorization: p t (x) = Φ t (x) Φt (x). The vector-valued quantities z t = g(t)∇ x log Φ t (x), ẑt = g(t)∇ x log Φt (x) fully characterize dynamics of the SBP, thus can be considered as the forward, backward "policies", analogous to policy-based methods described in Schulman et al. (2015) ; Pereira et al. (2019) . To draw a link between SBPs and SGMs, the data log-likelihood objective for SBPs is computed and shown to be equal to that of SGMs with special choices of z t , ẑt (derivation details in Chen et al. (2021a) ). Importantly, likelihood equality occurs with the following policies: (z t , ẑt ) = (0, g(t) ∇ x log p t (x)) (5) When the marginal p 1 at time t = 1 is equal to the prior distribution, it is known that such (z t , ẑt ) are achieved. Since in SGMs, the end marginal p 1 is indeed the standard Gaussian prior, their log-likelihood is equivalent to that of SBPs. This suggests that SGMs are a special case of SBPs with degenerate forward policy z t and a multiple of the score function as its backward ẑt . Probability Flow ODE In a similar vein to the SGM SDEs, a deterministic PF ODE can be derived for SBPs with identical marginal densities across t ∈ [0, 1]. The following PF ODE specifies the probability flow of the optimal processes of the SBP defined in Eq. ( 4) (Chen et al., 2021a) : dx = f (x, t) + g(t) z - 1 2 g(t)(z + ẑ) dt where z depends on x. We shall show that the PF ODEs for SGMs and SBPs are equivalent. Thus, flowing through the PF ODEs in DDIBs is equivalent to flowing through special Schrödinger Bridges, with one of the marginals being Gaussian.

3. DUAL DIFFUSION IMPLICIT BRIDGES

DDIBs leverage the connections between SGMs and SBPs to perform image-to-image translation, with two diffusion models trained separately on the two domains. DDIBs contain two steps, described in Alg. 1 and illustrated in Fig. 1 . At the core of the algorithm is the ODE solver ODESolve from Eq. ( 3). Given a source model represented as a vector field, i.e., v θ defined from Eq. ( 2), DDIBs first apply ODESolve in the source domain to obtain the encoding x (s) of the image at the end time t = 1; we refer to this as the latent code (associated with the diffusion model for the domain). Then, the source latent code is fed as the initial condition (target latent code at t = 1) to ODESolve with the target model v θ to obtain the target image x (t) . As discussed earlier, we implement ODESolve with DDIMs (Song et al., 2020a) , which are known to have reasonably small discretization errors. While recent developments in higher order ODE solvers (Zhang & Chen, 2022; Lu et al., 2022; Karras et al., 2022) that generalize DDIMs can also be used here, we leave this investigation to future work. Despite the simplicity of the method, DDIBs have several advantages over prior methods, which we discuss below. Exact Cycle Consistency A desirable feature of image translation algorithms is the cycle consistency property: transforming a data point from the source domain to the target domain, and then back to source, will recover the original data point in the source domain. The following proposition validates the cycle consistency of DDIBs. Proposition 3.1 (DDIBs Enforce Exact Cycle Consistency). Given a sample from source domain x (s) , a source diffusion model v (s) θ , and a target model v (t) θ , define: x (l) = ODESolve(x (s) ; v (s) θ , 0, 1); x (t) = ODESolve(x (l) ; v (t) θ , 1, 0); (7) x ′(l) = ODESolve(x (t) ; v (t) θ , 0, 1); x ′(s) = ODESolve(x ′(l) ; v (s) θ , 1, 0) Assume zero discretization error. Then, x (s) = x ′(s) . As PF ODEs are used, the cycle consistency property is guaranteed. In practice, even with discretization error, DDIBs incur almost negligible cycle inconsistency (Section 4.1). In contrast, GAN-based methods are not guaranteed the cycle consistency property by default, and have to incorporate additional training terms to optimize for cycle consistency over two domains.

Data Privacy in Both Domains

In the DDIBs translation process, only the source and target diffusion models are required, whose training processes do not depend on knowledge of the domain pair a priori. In fact, this process can even be performed in a privacy sensitive manner (graphic illustration in Appendix A). Let Alice and Bob be the data owners of the source and target domains, respectively. Suppose Alice intends to translate images to the target domain. However, Alice does not want to share the data with Bob (and vice versa, Bob does not want to release their data either). Then, Alice can simply train a diffusion model with the source data, encode the data to the latent space, transmit the latent codes to Bob, and next ask Bob to run their trained diffusion model and send the results back. In this procedure, only the latent code and the target results are transmitted between the two data vendors, and both parties have naturally ensured that their data are not directly revealed. DDIBs are Two Concatenated Schrödinger Bridges DDIBs link the source data distribution to the latent space, and then to the target distribution. What is the nature of such connections between distributions? We offer an answer from an optimal transport perspective: these connections are special Schrödinger Bridges between distributions. This, in turn, explicates the name of our method: dual diffusion implicit bridges are based on denoising diffusion implicit models (Song et al., 2020a) , and consist of two separate Schrödinger Bridges that connect the data and latent distributions. Specifically, as considered earlier, when conditions about the policies z t , ẑt in Eq. ( 5) and the density p 1 (x) being a Gaussian prior are met, the data likelihoods (at t = 0) for SGMs and SBPs are identical. Indeed, these conditions are fulfilled in SGMs and particularly in DDIMs. This verifies SGMs as special linear or degenerate SBPs. Forward and reverse solving the PF ODE for SGMs, as done in DDIBs, is equivalent to flowing through the optimal processes of particular SBPs: Proposition 3.2 (PF ODE Equivalencefoot_1 ). Eq. ( 2) is equivalent to Eq. (6) with forward, backward policies (z t , ẑt ) = (0, g∇ x log p t (x)) as attained in SGMs and particularly in DDIMs. Thus, DDIBs are intrinsically entropy-regularized optimal transport: they are Schrödinger Bridges between the source and the latent, and between the latent and the target distributions. The translation process can then be recognized as traversing through two concatenated Schrödinger Bridges, one forward and one reversed. The mapping is unique and minimizes a (regularized) optimal transport objective, which probably elucidates the superior performance of DDIBs. In contrast, if we train the source and target models separately with normalizing flow models that are not inborn with such a connection, there are many viable invertible mappings, and the resulting image translation algorithm may not necessarily have good performance. This is probably the reason why AlignFlow (Grover et al., 2020) still has to incorporate an adversarial loss even when cycle-consistency is guaranteed. We first perform domain translation on synthetic datasets drawn from complex two-dimensional distributions, with various shapes and configurations, in Fig. 2a . In total, we consider six 2D datasets: Moons (M); Checkerboards (CB); Concentric Rings (CR); Concentric Squares (CS); Parallel Rings (PR); and Parallel Squares (PS). The datasets are all normalized to have zero mean, and identity covariance. We assign colors to points based on the point identities (i.e., if a point in the source domain is red, its corresponding point in the target domain is also colored red). Clearly, the transformation is smooth between columns. For example, on the top-right corner, red points in the CR dataset are mapped to similar coordinates, both in the latent and in the target dimensions. Comparison to Alternative OT Methods As DDIBs are related to regularized OT, we compare the pixel-wise MSEs between color-transferred images generated by DDIBs, and images produced by alternate methods, in Table 2 . We include four OT methods for comparison: Earth Mover's Distance; Sinkhorn distance (Cuturi, 2013) ; linear and Gaussian mapping estimation (Perrot et al., 2016) . Results of DDIBs are very close to those of OT methods. Appendix E.2 details full color translation results. Paired Domain Translation As in similar works, we evaluate DDIBs on benchmark paired datasets (Zhu et al., 2017) : Facades and Maps. Both are image segmentation tasks. In the pairs of datasets, one dataset contains real photos taken via a camera or a satellite; while the other comprises the corresponding segmentation images. These datasets provide one-to-one image alignment, which allows quantitative evaluation through a distance metric such as mean-squared error (MSE) between generated samples and the corresponding ground truth. To facilitate the workings of DDIBs, we additionally employ a color conversion heuristic motivated by optimal transport on image colors (Appendix E.1). Table 3 reports the evaluation results. Surprisingly, DDIBs are able to produce segmentation images that surpass alternative methods in MSE terms; while reverse translations also achieve decent performance.

4.4. CLASS-CONDITIONAL IMAGENET TRANSLATION

In this experiment, we apply DDIBs to translation among ImageNet classes. To this end, we leverage the pretrained diffusion models from Dhariwal & Nichol (2021) . The authors optimized performance of diffusion models, and end up with a "UNet" (Ho et al., 2020) architecture with particular width, attention and residual configurations. The models are learned on 1, 000 ImageNet classes, each with around 1, 000 training images, and at a variety of resolutions. Our experiments use the model with resolution 256 × 256. Moreover, these models incorporate a technique known as classifier guidance (Dhariwal & Nichol, 2021) , that leverage classifier gradients to steer the sampling process towards arbitrary class labels during image generation. The learned models combined with classifier guidance can be effectively considered as 1, 000 different models. Fig. 4a exhibits select translation samples, where the source images are from ImageNet validation sets. DDIBs are able to create faithful target images that maintain much of the original content such as animal poses, complexions and emotions, while accounting for differences in animal species. Multi-Domain Translation Given conditional models on the individual domains, DDIBs can be applied to translate between arbitrary pairs of source-target domains, while requiring no additional fine-tuning or adaptation. 

5. RELATED WORKS

Score-based Diffusion Models Originating in thermodynamics (Sohl-Dickstein et al., 2015) , diffusion models reverse the dynamics of a noising process to create data samples. The reversal process is understood to implicitly compute scores of the data density at various noise scales, which reveals connections to score-based methods (Song & Ermon, 2019; Nichol & Dhariwal, 2021; Meng et al., 2021b) . Diffusion models are applicable to multiple modalities: 3D shapes (Zhou et al., 2021) , point cloud (Luo & Hu, 2021) , discrete domains (Meng et al., 2022) and function spaces (Lim et al., 2023) . They excel in tasks ranging from image editing and composition (Meng et al., 2021a) , density estimation (Kingma et al., 2021) , to image restoration (Kawar et al., 2022) . Seminal works are denoising diffusion probabilistic models (DDPMs, Ho et al. (2020) ), which parameterized the ELBO objective with Gaussians and, for the first time, synthesized high-quality images with diffusion models; ILVR (Choi et al., 2021) , which invented a novel conditional method to direct DDPM generation towards reference images; and denoising diffusion implicit models (DDIMs, Song et al. (2020a) ), which accelerated DDPM inference via non-Markovian processes. DDIMs can be treated as a first-order numerical solver of a probabilistic ODE, which we use heavily in DDIBs. Diffusion Models for Image Translation While GANs (Goodfellow et al., 2014; Zhu et al., 2017; Zhao et al., 2020) have been widely adopted in image translation tasks, recent works increasingly leverage diffusion models. For instance, Palette (Saharia et al., 2021) applies a conditional diffusion model to colorization, inpainting, and restoration. DiffuseIT (Kwon & Ye, 2022) utilizes disentangled style and content representation, to perform text-and image-guided style transfer. Lastly, UNIT-DDPM (Sasaki et al., 2021) proposes a novel coupling between domain pairs and trains joint DDPMs for translation. Unlike their joint training, DDIBs apply separate, pretrained diffusion models and leverage geometry of the shared space for translation. Optimal Transport for Translation and Generative Modeling As it pursues cost-optimal plans to connect image distributions, OT naturally finds applications in image translation. For example, Korotin et al. (2022) capitalizes on the approximation powers of neural networks to compute OT plans between image distributions and perform unpaired translation. By contrast, the entropy-regularized OT variant, Schrödinger Bridges (Section 2), are also commonly used to derive generative models. For instance, De Bortoli et al. ( 2021) and Vargas et al. (2021) concurrently proposed new numerical procedures that approximate the Iterative Proportional Fitting scheme, to solve SBPs for image generation. Wang et al. (2021) presents a new generative method via entropic interpolation with an SBP. Chen et al. (2021a) discovers equivalence between the likelihood objectives of SBP and score-based models, which lays the theoretical foundations behind DDIBs. Their sequel (Liu et al., 2023) then directly learns the Schrödinger Bridges between image distributions, for applications in image-to-image tasks such as restoration. While DDIBs were not initially designed to mimic Schrödinger Bridges, our analysis reveals their true characterization as solutions to degenerate SBPs.

6. CONCLUSIONS

We present Dual Diffusion Implicit Bridges (DDIBs), a new, simplistic image translation method that stems from latest progresses in score-based diffusion models, and is theoretically grounded as Schrödinger Bridges in the image space. DDIBs solve two key problems. First, DDIBs avoid optimization on a coupled loss specific to the given domain pair only. Second, DDIBs better safeguard dataset privacy as they no longer require presence of both datasets during training. Powerful pretrained diffusion models are then integrated into our DDIBs framework, to perform a comprehensive series of experiments that prove DDIBs' practical values in domain translation. Our method is limited in its application to color transfer, as one model is required for each image, which demands significant compute for mass experiments. Rooted in optimal transport, DDIBs translation mimics the massmoving process which may be problematic at times (Appendix C). Future work may remedy these issues, or extend DDIBs to applications with different dimensions in the source and target domains. As flowing through the concatenated ODEs is time-consuming, improving the translation speed is also a promising direction. A ILLUSTRATION: PRIVACY-SENSITIVE TRANSLATION Figure 5 Alice is the owner of the source (tiger) domain, and Bob is the owner of the target (cat) domain. Alice intends to translate tiger images to cat images, but in a privacy-sensitive manner without releasing the source dataset. Bob does not wish to make the cat dataset public, either. Fig. 5 illustrates the process of privacysensitive domain translation. The process contains the following steps, with indexes in the figure. 1. Alice intends to translate tiger images to cat images. 2. Alice trains a diffusion model with the source tiger images. 3. Alice uses the pretrained, tiger diffusion model to convert a source tiger image to its latent code. 4. Alice sends the latent code to Bob. 5. Bob similarly trains a diffusion model on the cat domain. 6. Bob uses the pretrained, cat diffusion model to convert the received latent code to a cat image. 7. Bob then sends the translated image back to Alice. Clearly, during the translation process, only the latent code and the translated cat image are transmitted via the public channel, while both source and target datasets are private to the two parties. This is a significant advantage of DDIBs over alternate methods, as we enable strong privacy protection of the datasets. The resulting noise prediction functions ϵ (t) θ , are equivalent to the score networks s t,θ mentioned in Section 2 due to Tweedie's formula (Stein, 1981; Efron, 2011) . For details, we refer the reader to Ho et al. (2020) ; Song et al. (2020a) .

B.2 DDIM ODE SOLVER

With a trained noise prediction model ϵ (t) θ (x), the DDIM iterate between adjacent variables x t-∆t and x t , considered in Song et al. (2020a) , assumes the following form: x t-∆t √ α t-∆t = x t √ α t + 1 -α t-∆t α t-∆t - 1 -α t α t ϵ (t) θ (x t ) In our experiments, we implement the above equation between adjacent diffusion steps. The equation is deterministic, and can be considered as a Euler method over the following ODE: dx(t) = ϵ (t) θ x(t) √ σ 2 + 1 dσ(t) where we adopt the reparameterization: σ(t) = 1 -α(t) α(t) , x(t) = x(t) α(t) Importantly, the ODE in Eq. ( 9) with the optimal model ϵ (t) θ (x), has an equivalent probability flow ODE corresponding to the "Variance-Exploding" SDE in Song et al. (2020b) .

C LIMITATIONS OF OPTIMAL TRANSPORT-BASED TRANSLATION

DDIBs contain deterministic bridges between distributions, and are a form of entropy-regularized optimal transport. The learned diffusion models can be effectively considered as a digest or summary of the datasets. While doing translation, they attempt to create images in the target domain, that are closest in optimal transport distances to the source images. Such OT-based process is both an advantage and a limitation of our method. In ImageNet translation, when the source and target datasets are similar, DDIBs are generally able to identify correct animal postures. For example, we have shouting lions and tigers, because these animals have similar behaviors that are observed in the datasets and then internalized by DDIBs. However, in datasets that are less similar (e.g. birds and dogs), DDIBs sometimes fail to produce translation results that retain the postures precisely. We encountered significantly less such cases in AFHQ translation, since the dataset is more standardized and homogeneous. Fig. 6 illustrates the optimal transport mappings among images as well as some failure cases. Clearly, the translation processes flowing from left to right minimize the Euclidean transportation distances between images. Some of these translated samples may be classified "failure cases" in actual user studies. Such are considered both a feature and a limitation of DDIBs. Proof. The proof proceeds by substituting the values of (z t , ẑt ) = (0, g(t)∇ x log p t (x)) into Eq. ( 6), dx = f (x, t) + g(t) z -1 2 g(t)(z + ẑ) dt (10) = f (x, t) - 1 2 g(t) 2 ∇ x log p t (x) dt This is exactly Eq. ( 2).



In our application, the reference measure is set to the measure of Eq. (1), as perChen et al. (2021a). Proof in Appendix D. Project: https://suxuann.github.io/ddib/ Code: https://github.com/suxuann/ddib/



(a) Smooth translation of synthetic datasets. (Left) The source datasets: CR and CS. (Middle) DDIBs' latent code representation. (Right) Results of translation to the target domains. (b) Cycle consistency: After translating the Moons dataset to Checkerboards and then back to Moons, DDIBs restore almost the exact same points as the original ones.

Figure 2: Smoothness and cycle consistency of DDIBs.

4.2 EXAMPLE-GUIDED COLOR TRANSFERDDIBs can be used on an interesting application: example-guided color transfer. This refers to the task of modifying the colors of an input image, conditioned on the color palette of a reference image. To use DDIBs for color transfer, we train one diffusion model per image, on its normalized RGB space. During translation, DDIBs obtain encodings of the original colors, and apply the diffusion model of the reference image to attain the desired color palette. Fig.3visualizes our color experiments.

Figure 3: Example-Guided Color Transfer: Given the first image as the reference image, DDIBs modify the colors of two input images to similarly follow a snowy winter color palette.

Fig. 4b displays results of translating a common image of a roaring lion (with class label 291), to various other ImageNet classes. Interestingly, some animals roar, while others stick their tongues out. DDIBs successfully internalize characteristics of distinct animal species, and produce closest animal postures in OT distances to the original shouting lion.(a) Conditional ImageNet Translation: Selected translation samples from various ImageNet classes such as 7: Cock, 94: Hummingbird, 162: Beagle, and 282: Tiger Cat. Multi-domain translation: Given the center, source image from class label 291, DDIBs translate it to other animal species, entirely using only a pretrained conditional diffusion model.

Figure 4: Translation among ImageNet classes.

Figure 6: Optimal transport translation processes in DDIBs. (Leftmost) Source images. (Rightmost) Translated images.

Cycle consistency of DDIBs. Experiment legend, PR ⟳ PS, means that we translate from PR to PS and then back. The numbers are the averaged L2 distances between the original points and their coordinates after cycle translation. Data points are standardized to have unit variance.

Mean Squared Error (MSE) comparing color transfer results of DDIBs with common OT methods on two images. Each number represents the MSE between DDIBs and the corresponding OT method. MSE is computed pixel-wise after normalizing images to [-1, 1].

MSE comparing DDIBs and baselines on paired test sets. MSE is computed pixel-wise after normalizing images to [-1, 1].

ACKNOWLEDGEMENTS

We thank Lingxiao Li and Chris Cundy for insightful discussions about the optimal transport properties of DDIBs. We also thank the anonymous reviewers for their constructive comments and feedback. This research was supported by NSF (#1651565), ARO (W911NF-21-1-0125), ONR (N00014-23-1-2159), CZ Biohub, and Stanford HAI.

E ADDITIONAL EXPERIMENTAL DETAILS E.1 OPTIMAL TRANSPORT IN PAIRED DATASETS

Color Conversion In Fig. 7 , a simple examination of the original and segmentation images reveals significant differences in color configurations. In the Maps dataset, while the real, satellite images are composed of dark colors, the segmentation images are light-toned. The same observation applies to other datasets. The shark contrasts in colors intuitively present a large transportation cost, that probably hinders the progress of DDIBs, as we have demonstrated its relationship to OT in Section 3.To facilitate the workings of DDIBs, we follow a heuristic to transform the colors of the segmentation images. Specifically, on a small subset of the train dataset, we run an OT algorithm to compute a color correspondence that minimizes the color differences in terms of Sinkhorn distances between the real and segmentation images. The segmentation (target) datasets undergo this color conversion before they are fed into a diffusion model for training. During evaluation, when we compute MSEs, the images are converted to the original color space.Privacy Protection Color conversion requires considering both datasets jointly to compute a color mapping, and seems to betray the original purpose of DDIBs on protection of dataset privacy. We comment that the amount of leaked information is minimal: for example, to compute a color correspondence for the Maps dataset, we sampled only around 1000 pixels from the two datasets, to summarize the color composition information. DDIBs still conserve privacy at large. 

