DIFFUSION-BASED IMAGE TRANSLATION USING DIS-ENTANGLED STYLE AND CONTENT REPRESENTATION

Abstract

Figure 1: Image translation results by DiffuseIT. Our model can generate high-quality translation outputs using both text and image conditions. More results can be found in the experiment section.

1. INTRODUCTION

Image translation is a task in which the model receives an input image and converts it into a target domain. Early image translation approaches (Zhu et al., 2017; Park et al., 2020; Isola et al., 2017) were mainly designed for single domain translation, but soon extended to multi-domain translation (Choi et al., 2018; Lee et al., 2019) . As these methods demand large training set for each domain, image translation approaches using only a single image pairs have been studied, which include the one-to-one image translation using multiscale training (Lin et al., 2020) , or patch matching strategy (Granot et al., 2022; Kolkin et al., 2019) . Most recently, Splicing ViT (Tumanyan et al., 2022) exploits a pre-trained DINO ViT (Caron et al., 2021) to convert the semantic appearance of a given image into a target domain while maintaining the structure of input image. On the other hand, by employing the recent text-to-image embedding model such as CLIP (Radford et al., 2021) , several approaches have attempted to generate images conditioned on text prompts (Patashnik et al., 2021; Gal et al., 2021; Crowson et al., 2022; Couairon et al., 2022) . As these methods rely on Generative Adversarial Networks (GAN) as a backbone generative model, the semantic changes are not often properly controlled when applied to an out-of-data (OOD) image generation.

