TEXT-GUIDED DIFFUSION IMAGE STYLE TRANSFER WITH CONTRASTIVE LOSS

Abstract

Recently, diffusion models have demonstrated superior performance in textguided image style transfer. However, due to the stochastic nature of the diffusion models, there exists a fundamental trade-off between transforming styles and maintaining content in the diffusion models. Although a simple remedy would be using deterministic sampling schemes such as denoising diffusion implicit model (DDIM) that guarantees the perfect reconstruction, it requires the computationally expensive fine-tuning of the diffusion models. To address this, here we present a text-guided sampling scheme using a patch-wise contrastive loss. By exploiting the contrastive loss between the samples and the original images, our diffusion model can generate an image with the same semantic content as the source image. Experimental results demonstrate that our approach outperforms the existing methods while maintaining content and requiring no additional training on the diffusion model.

1. INTRODUCTION

Style transfer is the task that converts the style of a given image into another style while preserving its content. Over the past few years, GAN-based methods such as pix2pix (Isola et al., 2017 ), cy-cleGAN (Zhu et al., 2017) , and contrastive unpaired image-to-image translation (CUT) have been developed (Park et al., 2020) . Recently, joint use of a pretrained image generator and image-text encoder enabled text-guided image editing which requires little or no training of the networks (Radford et al., 2021; Crowson et al., 2022; Patashnik et al., 2021; Gal et al., 2022; Kwon & Ye, 2022) . Inspired by the success of diffusion models for image generation (Ho et al., 2020; Song et al., 2020) , image editing (Liu et al., 2021 ), in-painting (Avrahami et al., 2022 ), super-resolution (Chung et al., 2022) , etc., many researchers have recently investigated the application of the diffusion models for image-to-image style transfer (Saharia et al., 2022; Su et al., 2022) . For example, (Saharia et al., 2022; 2021) proposed conditional diffusion models that require paired dataset for image-toimage style transfer. One of the limitations of these approaches is that the diffusion models need to be trained with paired data set with matched source and target styles. As collecting the matched source and target domain data is impractical, many recent researchers have focused on unconditional diffusion models. Figure 1 : Results of our style transfer method on various artistic styles. The source images are translated into various styles while maintaining their structure.

