COLORISTANET FOR PHOTOREALISTIC VIDEO STYLE TRANSFER

Abstract

Photorealistic style transfer aims to transfer the artistic style of an image onto an input image or video while keeping photorealism. In this paper, we think it's the summary statistics matching scheme in existing algorithms that leads to unrealistic stylization. To avoid employing the popular Gram loss, we propose a self-supervised style transfer framework, which contains a style removal part and a style restoration part. The style removal network removes the original image styles, and the style restoration network recovers image styles in a supervised manner. Meanwhile, to address the problems in current feature transformation methods, we propose docouple instance normalization to decompose feature transformation into style whitening and restylization. It works quite well in ColoristaNet and can transfer image styles efficiently while keeping photorealism. To ensure temporal coherency, we also incorporate optical flow methods and ConvLSTM to embed contextual information. Experiments demonstrates that ColoristaNet can achieve better stylization effects when compared with state-of-the-art algorithms.

1. INTRODUCTION

Nowadays rapid development of video-capture devices has made videos become a mainstream information carrier (Hansen, 2004) . People usually post videos accompanied with different color styles on social media (Kopf et al., 2012; Xu et al., 2014) to share daily life, express different emotions, and get more exposures (Yan et al., 2016; Zabaleta & Bertalmío, 2021) . Thus, photorealistic video style transfer or automatic color stylization becomes popular in many mobile devices. Different from artistic style transfer (Gatys et al., 2016; Huang & Belongie, 2017) , photorealistic video style transfer or automatic color stylization needs to replace color styles in original videos with one or multiple reference images and keep the outputs maintain "photorealism". The photorealism in style transfer refers to that stylization results should look like real photos taken from cameras without any spatial distortions or unrealistic artifacts. Moreover, algorithms need to run in realtime. Although these algorithms can conduct style transfers in many scenarios, their stylization results still contain unpleasant artifacts or look unreal, and some algorithms need additional supports. In Figure 1 (a), given a content image which contains a tree in autumn and a style reference, previous state-ofthe-art algorithm WCT 2 (Yoo et al., 2019) will generate synthesized images with obvious structural artifacts. Besides, these algorithms conduct style transfer by matching the summary statistics of content features with style references completely, which will lead to unrealistic stylization as in Figure 1 (b). For photorealistic style transfer in videos, there are only very few existing algorithms that can only perform style transfer with constraints. MVStylizer (Li et al., 2020) need good stylization initilaization at the first frame and Xia's method (Xia et al., 2021) incorporates additional semantic masks for each frame in videos. These problems limit these methods' usage in many real applications.



Several popular algorithms have been proposed to conduct photorealistic style transfer for single image. DeepPhoto (Luan et al., 2017) incorporated semantic segmentation masks to guide style transfer and utilized a photorealism regularization term to reduce spatial distortions. PhotoWCT (Li et al., 2018) exploited whitening and coloring transforms (WCT (Li et al., 2017c)) to conduct arbitrary style transfer and used photorealistic smoothing to remove spatially inconsistent stylization. WCT 2 (Yoo et al., 2019) proposed a wavelet corrected transfer based on WCT to preserve structural information while stylizing images at the same time. PhotoNAS (An et al., 2020) proposed a neural architecture search framework for photorealistic style transfer and achieved impressive results.

