COLORISTANET FOR PHOTOREALISTIC VIDEO STYLE TRANSFER

Abstract

Photorealistic style transfer aims to transfer the artistic style of an image onto an input image or video while keeping photorealism. In this paper, we think it's the summary statistics matching scheme in existing algorithms that leads to unrealistic stylization. To avoid employing the popular Gram loss, we propose a self-supervised style transfer framework, which contains a style removal part and a style restoration part. The style removal network removes the original image styles, and the style restoration network recovers image styles in a supervised manner. Meanwhile, to address the problems in current feature transformation methods, we propose docouple instance normalization to decompose feature transformation into style whitening and restylization. It works quite well in ColoristaNet and can transfer image styles efficiently while keeping photorealism. To ensure temporal coherency, we also incorporate optical flow methods and ConvLSTM to embed contextual information. Experiments demonstrates that ColoristaNet can achieve better stylization effects when compared with state-of-the-art algorithms.

1. INTRODUCTION

Nowadays rapid development of video-capture devices has made videos become a mainstream information carrier (Hansen, 2004) . People usually post videos accompanied with different color styles on social media (Kopf et al., 2012; Xu et al., 2014) to share daily life, express different emotions, and get more exposures (Yan et al., 2016; Zabaleta & Bertalmío, 2021) . Thus, photorealistic video style transfer or automatic color stylization becomes popular in many mobile devices. Different from artistic style transfer (Gatys et al., 2016; Huang & Belongie, 2017) , photorealistic video style transfer or automatic color stylization needs to replace color styles in original videos with one or multiple reference images and keep the outputs maintain "photorealism". The photorealism in style transfer refers to that stylization results should look like real photos taken from cameras without any spatial distortions or unrealistic artifacts. Moreover, algorithms need to run in realtime. Several popular algorithms have been proposed to conduct photorealistic style transfer for single image. DeepPhoto (Luan et al., 2017) incorporated semantic segmentation masks to guide style transfer and utilized a photorealism regularization term to reduce spatial distortions. PhotoWCT (Li et al., 2018) exploited whitening and coloring transforms (WCT (Li et al., 2017c) ) to conduct arbitrary style transfer and used photorealistic smoothing to remove spatially inconsistent stylization. WCT 2 (Yoo et al., 2019) proposed a wavelet corrected transfer based on WCT to preserve structural information while stylizing images at the same time. PhotoNAS (An et al., 2020) proposed a neural architecture search framework for photorealistic style transfer and achieved impressive results. Although these algorithms can conduct style transfers in many scenarios, their stylization results still contain unpleasant artifacts or look unreal, and some algorithms need additional supports. In Figure 1 (a), given a content image which contains a tree in autumn and a style reference, previous state-ofthe-art algorithm WCT 2 (Yoo et al., 2019) will generate synthesized images with obvious structural artifacts. Besides, these algorithms conduct style transfer by matching the summary statistics of content features with style references completely, which will lead to unrealistic stylization as in Figure 1 (b). For photorealistic style transfer in videos, there are only very few existing algorithms that can only perform style transfer with constraints. MVStylizer (Li et al., 2020) need good stylization initilaization at the first frame and Xia's method (Xia et al., 2021) incorporates additional semantic masks for each frame in videos. These problems limit these methods' usage in many real applications. In this paper, we aim to solve the problems listed above in photorealistic video style transfer. Different from previous algorithms which match summary statistics of content images to that of style references through whitening and coloring transformation (Li et al., 2018) , adaptive instance normalization (An et al., 2020) and the Gram loss (Luan et al., 2017) , we propose a style removal and restoration framework in a self-supervised manner to conduct arbitrary style transfer while keeping photorealism. Our motivation is that during photorealistic style transfer, if we can remove the style of image content without destroying image structures, we can recover its original style by using the content image both as style reference and stylization target. According to our experiences, artifacts produced by PhotoWCT (Li et al., 2018) , WCT 2 (Yoo et al., 2019), and PhotoNAS (An et al., 2020) come from two parts: (1) the Gram loss; (2) whitening and coloring transformation (WCT (Li et al., 2017c) ). In our method, we avoid using the Gram loss and train networks with the content loss only (Gatys et al., 2016) . We improve the summary statistics matching scheme with decoupled instance normalization which can remove original image styles and add new styles for inputs without hurting image structures. Meanwhile, decoupled instance normalization does not match styles of reference images completely and avoid unrealistic stylization in Figure 1 (b). To keep temporal consistency in videos, we exploit optical flow estimation (Teed & Deng, 2020) and ConvLSTM (Shi et al., 2015a) to conduct consecutively style transfer. We summarize our contributions as follows: • In this paper, we propose a novel photorealistic video style transfer network called ColoristaNet, which can conduct color style transfer in videos without introducing painterly spatial distortions and inconsistent flickering artifacts. We put many videos in the supplementary material to compare with other state-of-the-art algorithms. • We propose decoupled instance normalization which works together with ConvLSTM (Shi et al., 2015a) to implement structure-preserving and temporally consistent feature transformation. The decoupled instance normalization decomposes style transfer into feature whitening and stylization, which can avoid unrealistic style transfer. • ColoristaNet can adapt color styles in videos consecutively with multiple different style references and runs faster than most of recent algorithms. Qualitative results and a user study show that our method outperforms other state-of-art algorithms in making a balance between good stylization results and photorealism. Besides, we also conduct extensive ablation studies whose results demonstrate the effectiveness of different modules and designs in ColoristaNet clearly.



Figure 1: Illustration of unsolved problems in photorealistic style transfer. From left to right: (a) Previous state-of-the-art algorithm WCT 2 (Yoo et al., 2019) generates stylization results with obvious structural artifacts. (b) The stylization result produced by WCT 2 (Yoo et al., 2019) looks painterly and slightly unreal. (c) Video stylization algorithms need additional inputs, such as good stylization initialization (Li et al., 2020) or semantic masks (Xia et al., 2021), to guide style transfer.

