TEXT-GUIDED DIFFUSION IMAGE STYLE TRANSFER WITH CONTRASTIVE LOSS

Abstract

Recently, diffusion models have demonstrated superior performance in textguided image style transfer. However, due to the stochastic nature of the diffusion models, there exists a fundamental trade-off between transforming styles and maintaining content in the diffusion models. Although a simple remedy would be using deterministic sampling schemes such as denoising diffusion implicit model (DDIM) that guarantees the perfect reconstruction, it requires the computationally expensive fine-tuning of the diffusion models. To address this, here we present a text-guided sampling scheme using a patch-wise contrastive loss. By exploiting the contrastive loss between the samples and the original images, our diffusion model can generate an image with the same semantic content as the source image. Experimental results demonstrate that our approach outperforms the existing methods while maintaining content and requiring no additional training on the diffusion model.

1. INTRODUCTION

Style transfer is the task that converts the style of a given image into another style while preserving its content. Over the past few years, GAN-based methods such as pix2pix (Isola et al., 2017) , cy-cleGAN (Zhu et al., 2017) , and contrastive unpaired image-to-image translation (CUT) have been developed (Park et al., 2020) . Recently, joint use of a pretrained image generator and image-text encoder enabled text-guided image editing which requires little or no training of the networks (Radford et al., 2021; Crowson et al., 2022; Patashnik et al., 2021; Gal et al., 2022; Kwon & Ye, 2022) . Inspired by the success of diffusion models for image generation (Ho et al., 2020; Song et al., 2020) , image editing (Liu et al., 2021 ), in-painting (Avrahami et al., 2022 ), super-resolution (Chung et al., 2022) , etc., many researchers have recently investigated the application of the diffusion models for image-to-image style transfer (Saharia et al., 2022; Su et al., 2022) . For example, (Saharia et al., 2022; 2021) proposed conditional diffusion models that require paired dataset for image-toimage style transfer. One of the limitations of these approaches is that the diffusion models need to be trained with paired data set with matched source and target styles. As collecting the matched source and target domain data is impractical, many recent researchers have focused on unconditional diffusion models. Figure 1 : Results of our style transfer method on various artistic styles. The source images are translated into various styles while maintaining their structure. For example, the dual diffusion implicit bridge (DDIB) (Su et al., 2022) exploits two score functions that have been independently trained on two different domains. Although DDIB can translate one image into another without any external condition, it also requires training of two diffusion models for each domain which involves additional training time and a large amount of dataset. On the other hand, DiffusionCLIP (Kim et al., 2022) leverages the pretrained diffusion models and CLIP encoder to enable text-driven image style transfer without additional large training data set. Unfortunately, DiffusionCLIP still requires additional fine-tuning of the model for the desired style. Besides the additional complexity, unconditional diffusion models for image style transfer have further limitations in maintaining content. This is because the reverse sampling procedure of the diffusion models does not have an explicit constraint to impose the content consistency, and the stochastic nature of diffusion models makes them easy to change the content and styles at the same time. To address this, here we propose a diffusion model that transfers the style of a given image while preserving its semantic content by using contrastive loss similar to CUT (Park et al., 2020) . Since contrastive loss can exploit the spatial information in terms of positive and negative pairs, we found that diffusion model already contains the spatial information that can be used to maintain the content. Furthermore, in contrast to DiffusionCLIP, it only requires the CUT loss fine-tuning via light weighted multi-layer perceptron (MLP) layers rather than the diffusion model, so the computational complexity can be significantly reduced. Even more, thanks to the extracted spatial features from diffusion models, we observe that the MLP fine-tuning is not even necessary with slight decrease in quality. To verify the effectiveness of this method, we present a text-driven style transfer using CLIP (Radford et al., 2021) . In particular, we utilize CLIP in a patch-wise manner similar to (Kwon & Ye, 2022) thanks to its stable style translation. Our contribution can be summarized as following: • Thanks to the content disentanglement using contrastive loss, to our best knowledge, our method is the first style transfer method with unconditional diffusion model that overcomes the trade-off between style and content. • Our method only requires contrastive loss from the pre-trained diffusion models rather than fine-tuning the diffusion model for target domain, so the computational complexity is much low but still allows effective image transfer to any unseen domain.

2. RELATED WORKS

Image style transfer Neural style transfer (Gatys et al., 2016) is the first approach to change the style texture of the content image into a style image by iterative optimization process. However, these iterative process takes significant amount of time. Alternatively, the adaptive instance normalization (AdaIN) by (Huang & Belongie, 2017) converts the means and variances of the features of the source image to those of the target image, which enables arbitrary style transfer. On the other hand, pix2pix (Isola et al., 2017) , CycleGAN (Zhu et al., 2017) and CUT (Park et al., 2020) rely on different mechanisms for content preservation. Specifically, in CycleGAN (Zhu et al., 2017) , the cycle consistency assumes bijective relationship between two domains for content preservation, whose constraint is often restrictive in some applications. In order to overcome this restriction, CUT (Park et al., 2020) was proposed to maximize the mutual information between the content input and stylized output images in a patch-based manner on the feature space. This leads to preservation of the structure between the two images while changing appearance. With the advent of CLIP model (Radford et al., 2021) , it has been shown that text-guided image synthesis can be accomplished without collecting style images. CLIP has semantic representative power which results from large scale dataset consisting of 400 millions image and text pairs. This enables text-driven image manipulation. StyleCLIP (Patashnik et al., 2021) was proposed to optimize latent vector of the content input given text prompt by using CLIP and pretrained StyleGAN (Karras et al., 2020) . However, image modification using StyleCLIP is limited to the domain of the pretrained generator. In order to solve this issue, StyleGAN-NADA (Gal et al., 2022) presented outof-domain image manipulation method that shifts the generative model to new domains. VQGAN-CLIP (Crowson et al., 2022) has shown that VQGAN (Esser et al., 2021) can also be used as a pretrained generative model to generate or edit high quality images without training. In order not to be restricted to the domains of the pretrained generators, CLIPstyler (Kwon & Ye, 2022) proposed a

