DIFFUSION-BASED IMAGE TRANSLATION USING DIS-ENTANGLED STYLE AND CONTENT REPRESENTATION

Abstract

Figure 1: Image translation results by DiffuseIT. Our model can generate high-quality translation outputs using both text and image conditions. More results can be found in the experiment section.

1. INTRODUCTION

Image translation is a task in which the model receives an input image and converts it into a target domain. Early image translation approaches (Zhu et al., 2017; Park et al., 2020; Isola et al., 2017) were mainly designed for single domain translation, but soon extended to multi-domain translation (Choi et al., 2018; Lee et al., 2019) . As these methods demand large training set for each domain, image translation approaches using only a single image pairs have been studied, which include the one-to-one image translation using multiscale training (Lin et al., 2020) , or patch matching strategy (Granot et al., 2022; Kolkin et al., 2019) . Most recently, Splicing ViT (Tumanyan et al., 2022) exploits a pre-trained DINO ViT (Caron et al., 2021) to convert the semantic appearance of a given image into a target domain while maintaining the structure of input image. On the other hand, by employing the recent text-to-image embedding model such as CLIP (Radford et al., 2021) , several approaches have attempted to generate images conditioned on text prompts (Patashnik et al., 2021; Gal et al., 2021; Crowson et al., 2022; Couairon et al., 2022) . As these methods rely on Generative Adversarial Networks (GAN) as a backbone generative model, the semantic changes are not often properly controlled when applied to an out-of-data (OOD) image generation. Recently, score-based generative models (Ho et al., 2020; Song et al., 2020b; Nichol & Dhariwal, 2021) have demonstrated state-of-the-art performance in text-conditioned image generation (Ramesh et al., 2022; Saharia et al., 2022b; Crowson, 2022; Avrahami et al., 2022) . However, when it comes to the image translation scenario in which multiple conditions (e.g. input image, text condition) are given to the score based model, disentangling and separately controlling the components still remains as an open problem. In fact, one of the most important open questions in image translation by diffusion models is to transform only the semantic information (or style) while maintaining the structure information (or content) of the input image. Although this could not be an issue with the conditional diffusion models trained with matched input and target domain images (Saharia et al., 2022a) , such training is impractical in many image translation tasks (e.g. summer-to-winter, horse-to-zebra translation). On the other hand, existing methods using unconditional diffusion models often fail to preserve content information due to the entanglement problems in which semantic and content change at the same time (Avrahami et al., 2022; Crowson, 2022) . DiffusionCLIP (Kim et al., 2022) tried to address this problem using denoising diffusion implicit models (DDIM) (Song et al., 2020a) and pixel-wise loss, but the score function needs to be fine-tuned for a novel target domain, which is computationally expensive. In order to control the diffusion process in such a way that it produces the output that simultaneously retain the content of the input image and follow the semantics of the target text or image, here we introduce a loss function using a pre-trained Vision Transformer (ViT) (Dosovitskiy et al., 2020) . Specifically, inspired by the recent idea (Tumanyan et al., 2022), we extract intermediate keys of multihead self attention layer and [CLS] classification tokens of the last layer from the DINO ViT model and used them as our content and style regularization, respectively. More specifically, to preserve the structural information, we use the similarity and contrastive loss between intermediate keys of the input and denoised image during the sampling. Then, an image guided style transfer is performed by matching the [CLS] token between the denoised sample and the target domain, whereas additional CLIP loss is used for the text-driven style transfer. To further improve the sampling speed, we propose a novel semantic divergence loss and resampling strategy. Extensive experimental results including Fig. 1 confirmed that our method provide state-of-the-art performance in both text-and image-guided style transfer tasks quantitatively and qualitatively. To our best knowledge, this is the first unconditional diffusion model-based image translation method that allows both text-and image-guided style transfer without altering input image content.

2. RELATED WORK

Text-guided image synthesis. Thanks to the outstanding performance of text-to-image alignment in the feature space, CLIP has been widely used in various text-related computer vision tasks including object generation (Liu et al., 2021; Wang et al., 2022a) , style transfer (Kwon & Ye, 2021; Fu et al., 2021) , object segmentation (Lüddecke & Ecker, 2022; Wang et al., 2022b) , etc. Several recent approaches also demonstrated state-of-the-art performance in text-guided image manipulation task by combining the CLIP with image generation models. Previous approaches leverage pre-trained StyleGAN (Karras et al., 2020) for image manipulation with a text condition (Patashnik et al., 2021; Gal et al., 2021; Wei et al., 2022) . However, StyleGAN-based methods cannot be used in arbitrary natural images since it is restricted to the pre-trained data domain. Pre-trained VQGAN (Esser et al., 2021) was proposed for better generalization capability in the image manipulation, but it often suffers from poor image quality due to limited power of the backbone model. With the advance of score-based generative models such as Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) , several methods (Ramesh et al., 2022; Saharia et al., 2022b) tried to generate photo-realistic image samples with given text conditions. However, these approaches are not adequate for image translation framework as the text condition and input image are not usually disentangled. Although DiffusionCLIP (Kim et al., 2022) partially solves the problem using DDIM sampling and pixelwise regularization during the reverse diffusion, it has major disadvantage in that it requires fine-tuning process of score models. As a concurrent work, DDIB (Su et al., 2022) proposed diffusion model based image translation using deterministic probability flow ODE formulation.

