STYLE SPECTROSCOPE: IMPROVE INTERPRETABILITY AND CONTROLLABILITY THROUGH FOURIER ANALY-SIS

Abstract

Universal style transfer (UST) infuses styles from arbitrary reference images into content images. Existing methods, while enjoying many practical successes, are unable of explaining experimental observations, including different performances of UST algorithms in preserving the spatial structure of content images. In addition, methods are limited to cumbersome global controls on stylization, so that they require additional spatial masks for desired stylization. In this work, we provide a systematic Fourier analysis on a general framework for UST. We present an equivalent form of the framework in the frequency domain. The form implies that existing algorithms treat all frequency components and pixels of feature maps equally, except for the zero-frequency component. We connect Fourier amplitude and phase with Gram matrices and a content reconstruction loss in style transfer, respectively. Based on such equivalence and connections, we can thus interpret different structure preservation behaviors between algorithms with Fourier phase. Given the interpretations we have, we propose two manipulations in practice for structure preservation and desired stylization. Both qualitative and quantitative experiments demonstrate the competitive performance of our method against the state-of-the-art methods. We also conduct experiments to demonstrate (1) the abovementioned equivalence, (2) the interpretability based on Fourier amplitude and phase and (3) the controllability associated with frequency components.

1. INTRODUCTION

Style transfer deals with the problem of synthesizing an image which has the style characteristics from a style image and the content representation from a content image. The seminal work (Gatys et al., 2016) uses Gram matrices of feature maps to model style characteristics and optimizes reconstruction losses between the reference images and stylized images iteratively. For the purpose of gaining vivid visual styles and less computation cost, more trained feed-forward networks are proposed (Wang et al., 2020; Li & Wand, 2016; Johnson et al., 2016; Sheng et al., 2018; Li et al., 2017b; Sty; Dumoulin et al., 2017) . Recent works focus on arbitrary style transfer (Park & Lee, 2019; Chen et al., 2021a; b; Chandran et al., 2021) , or artistic style (Chen et al., 2021b; Liu et al., 2021; Chen et al., 2021c) . These works capture limited types of style and cannot well generalize to unseen style images (Hong et al., 2021) . To obtain the generalization ability for arbitrary style images, many methods are proposed for the task of universal style transfer (UST). Essentially, the main challenge of UST is to properly extract the style characteristics from style images and transfer them onto content images without any prior knowledge of target style. The representative methods of UST consider various notions of style characteristics. For example, AdaIN (Huang & Belongie, 2017) aligns the channel-wise means and variances of feature maps between content images and style images, and WCT (Li et al., 2017a) further matches up the covariance matrices of feature maps by means of whitening and coloring processes, leading to more expressive colors and intensive stylization. While these two approaches and their derivative works show impressive performances on stylization, they behave differently in preserving the structure of content images. It is observed that the operations performed by AdaIN can do better in structure preservation of content images while those 1

