DUAL DIFFUSION IMPLICIT BRIDGES FOR IMAGE-TO-IMAGE TRANSLATION

Abstract

Common image-to-image translation methods rely on joint training over data from both source and target domains. The training process requires concurrent access to both datasets, which hinders data separation and privacy protection; and existing models cannot be easily adapted for translation of new domain pairs. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models, that circumvents training on domain pairs. Image translation with DDIBs relies on two diffusion models trained independently on each domain, and is a two-step process: DDIBs first obtain latent encodings for source images with the source diffusion model, and then decode such encodings using the target model to construct target images. Both steps are defined via ordinary differential equations (ODEs), thus the process is cycle consistent only up to discretization errors of the ODE solvers. Theoretically, we interpret DDIBs as concatenation of source to latent, and latent to target Schrödinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method. Experimentally, we apply DDIBs on synthetic and high-resolution image datasets, to demonstrate their utility in a wide variety of translation tasks and their inherent optimal transport properties.

1. INTRODUCTION

Transferring images from one domain to another while preserving the content representation is an important problem in computer vision, with wide applications that span style transfer (Xu et al., 2021; Sinha et al., 2021) and semantic segmentation (Li et al., 2020) . In tasks such as style transfer, it is usually difficult to obtain paired images of realistic scenes and their artistic renditions. Consequently, unpaired translation methods are particularly relevant, since only the datasets, and not the one-toone correspondence between image translation pairs, are required. Common methods on unpaired translation are based on generative adversarial networks (GANs, Goodfellow et al. (2014); Zhu et al. (2017) ) or normalizing flows (Grover et al., 2020) . Training such models typically involves minimizing an adversarial loss between a specific pair of source and target datasets. While capable of producing high-quality images, these methods suffer from a severe drawback in their adaptability to alternative domains. Concretely, a translation model on a source-target pair is trained specifically for this domain pair. Provided a different pair, existing, bespoke models cannot be easily adapted for translation. If we were to do pairwise translation among a set of domains, the total number of models needed is quadratic in the number of domains -an unacceptable computational cost in practice. One alternative is to find a shared domain that connects to each source / target domains as in StarGANs (Choi et al., 2018) . However, the shared domain needs to be carefully chosen a priori; if the shared domain contains less information than the target domain (e.g. sketches v.s. photos), then it creates an unwanted information bottleneck between the source and target domains. An additional disadvantage of existing models resides in their lack of privacy protection of the datasets: training a translation model requires access to both datasets simultaneously. Such setting may be inconvenient or impossible, when data providers are reluctant about giving away their data; or for certain privacy-sensitive applications such as medical imaging. For example, quotidian hospital usage may require translation of patients' X-ray and MRI images taken from machines in other hospitals. Most existing methods will fail in such scenarios, as joint training requires aggregating confidential imaging data across hospitals, which may violate patients' privacy. s) , the source ODE runs in the forward direction to convert it to the latent x (l) , while the target, reverse ODE then constructs the target image x (t) . (Top) Illustration of the DDIBs idea between two one-dimensional distributions. (Bottom) DDIBs from a tiger to a cat using a pretrained conditional diffusion model. In this paper, we seek to mitigate both problems of existing image translation methods. We present Dual Diffusion Implicit Bridges (DDIBs), an image-to-image translation method inspired by recent advances in diffusion models (Song et al., 2020a; b) , that decouples paired training, and empowers the domain-specific diffusion models to stay applicable in other pairs wherever the domain appears again as the source or the target. Since the training process now concentrates on one dataset at a time, DDIBs can also be applied in federated settings, and not assume access to both datasets during model training. As a result, owners of domain data can effectively preserve their data privacy. Specifically, DDIBs are developed based on the method known as denoising diffusion implicit models (DDIMs, Song et al. (2020a) ). DDIMs invent a particular parameterization of the diffusion process, that creates a smooth, deterministic and reversible mapping between images and their latent representations. This mapping is captured using the solution to a so-called probability flow (PF) ordinary differential equation (ODE) that forms the cornerstone of DDIBs. Translation with DDIBs on a source-target pair requires two different PF ODEs: the source PF ODE converts input images to the latent space; while the target ODE then synthesizes images in the target domain. Crucially, trained diffusion models are specific to the individual domains, and rely on no domain pairing information. Effectively, DDIBs make it possible to save a trained model of a certain domain for future use, when it arises as the source or target in a new pair. Pairwise translation with DDIBs requires only a linear number of diffusion models (which can be further reduced with conditional models (Dhariwal & Nichol, 2021)), and training does not require scanning both datasets concurrently. Theoretically, we analyze the DDIBs translation process to highlight two important theoretical properties. First, the probability flow ODEs in DDIBs, in essence, comprise the solution of a special Schrödinger Bridge Problem (SBP) with linear or degenerate drift (Chen et al., 2021a) , between the data and the latent distributions. This justification of DDIBs from an optimal transport viewpoint that alternative translation methods lack serves as a theoretical advantage of our method, as DDIBs are the most OT-efficient translation procedure while alternate methods may not be. Second, DDIBs guarantee exact cycle consistency: translating an image to and back from the target space reinstates the original image, only up to discretization errors introduced in the ODE solvers. Experimentally, we first present synthetic experiments on two-dimensional datasets to demonstrate DDIBs' cycle-consistency property. We then evaluate our method on a variety of image modalities, with qualitative and quantitative results: we validate its usage in example-guided color transfer, paired image translation, and conditional ImageNet translation. These results establish DDIBs as a scalable, theoretically rigorous addition to the family of unpaired image translation methods.

2.1. SCORE-BASED GENERATIVE MODELS (SGMS)

While our actual implementation utilizes DDIMs, we first briefly introduce the broader family of models known as score-based generative models. Two representative models of this family are score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019) and denoising diffusion



Figure 1: Dual Diffusion Implicit Bridges: DDIBs leverage two ODEs for image translation. Given a source image x (s) , the source ODE runs in the forward direction to convert it to the latent x (l) , while the target, reverse ODE then constructs the target image x(t) . (Top) Illustration of the DDIBs idea between two one-dimensional distributions. (Bottom) DDIBs from a tiger to a cat using a pretrained conditional diffusion model.

