STYLE SPECTROSCOPE: IMPROVE INTERPRETABILITY AND CONTROLLABILITY THROUGH FOURIER ANALY-SIS

Abstract

Universal style transfer (UST) infuses styles from arbitrary reference images into content images. Existing methods, while enjoying many practical successes, are unable of explaining experimental observations, including different performances of UST algorithms in preserving the spatial structure of content images. In addition, methods are limited to cumbersome global controls on stylization, so that they require additional spatial masks for desired stylization. In this work, we provide a systematic Fourier analysis on a general framework for UST. We present an equivalent form of the framework in the frequency domain. The form implies that existing algorithms treat all frequency components and pixels of feature maps equally, except for the zero-frequency component. We connect Fourier amplitude and phase with Gram matrices and a content reconstruction loss in style transfer, respectively. Based on such equivalence and connections, we can thus interpret different structure preservation behaviors between algorithms with Fourier phase. Given the interpretations we have, we propose two manipulations in practice for structure preservation and desired stylization. Both qualitative and quantitative experiments demonstrate the competitive performance of our method against the state-of-the-art methods. We also conduct experiments to demonstrate (1) the abovementioned equivalence, (2) the interpretability based on Fourier amplitude and phase and (3) the controllability associated with frequency components.

1. INTRODUCTION

Style transfer deals with the problem of synthesizing an image which has the style characteristics from a style image and the content representation from a content image. The seminal work (Gatys et al., 2016) uses Gram matrices of feature maps to model style characteristics and optimizes reconstruction losses between the reference images and stylized images iteratively. For the purpose of gaining vivid visual styles and less computation cost, more trained feed-forward networks are proposed (Wang et al., 2020; Li & Wand, 2016; Johnson et al., 2016; Sheng et al., 2018; Li et al., 2017b; Sty; Dumoulin et al., 2017) . Recent works focus on arbitrary style transfer (Park & Lee, 2019; Chen et al., 2021a; b; Chandran et al., 2021) , or artistic style (Chen et al., 2021b; Liu et al., 2021; Chen et al., 2021c) . These works capture limited types of style and cannot well generalize to unseen style images (Hong et al., 2021) . To obtain the generalization ability for arbitrary style images, many methods are proposed for the task of universal style transfer (UST). Essentially, the main challenge of UST is to properly extract the style characteristics from style images and transfer them onto content images without any prior knowledge of target style. The representative methods of UST consider various notions of style characteristics. For example, AdaIN (Huang & Belongie, 2017) aligns the channel-wise means and variances of feature maps between content images and style images, and WCT (Li et al., 2017a) further matches up the covariance matrices of feature maps by means of whitening and coloring processes, leading to more expressive colors and intensive stylization. While these two approaches and their derivative works show impressive performances on stylization, they behave differently in preserving the structure of content images. It is observed that the operations performed by AdaIN can do better in structure preservation of content images while those of WCT might introduce structural artifacts and distortions. Many follow-up works focus on alleviating the problem of WCT (Li et al., 2018; Chiu & Gurari, 2022; Yoo et al., 2019) , but seldom can analytically and systematically explain what makes the difference. In the field of UST, we need an analytical theory to bridge algorithms with experimental phenomena for better interpretability, potentially leading to better stylization controls. To this end, we resort to apply Fourier transform for deep analysis, aiming to find new equivalence in frequency domain and bring new interpretations and practical manipulations to existing style transfer methods. In this work, we first revisit a framework by (Li et al., 2017a) which unifies several well-known UST methods. Based on the framework, we derive an equivalent form for it in the frequency domain, which has the same simplicity with its original form in the spatial domain. Accordingly, the derived result demonstrates that these UST methods perform a uniform transformation in the frequency domain except for the origin. Furthermore, these UST methods transform frequency components (excluding the zero-frequency component) and spatial pixels of feature maps in an identical manner. Thus, these UST methods perform manipulations on the whole frequency domain instead of specific subsets of frequencies (either high frequencies or low frequencies). Secondly, through the lens of the Fourier transform, we further explore the relation of Fourier phase and amplitude with key notions in style transfer, and then we present new interpretations based on the equivalence we have. On one hand, we prove that a content reconstruction loss between two feature maps reaches a local minimum when they have identical Fourier phase, which implies that Fourier phase of feature maps contributes to the structure of stylized results. On the other hand, we prove that the Fourier amplitude of feature maps determines the diagonals of their Gram matrices, which implies that Fourier amplitude contributes to the intensity information of stylized images. Next, We demonstrate that AdaIN does preserve the Fourier phase of feature maps while WCT does not, and we interpret the different behaviors between the UST methods in structure preservation as a consequence of their different treatment with the Fourier phase of feature maps. Thirdly, based on the connection we establish between style transfer and Fourier transfer, we propose two manipulations on the frequency components of feature maps: 1) a phase replacement operation to keep phase of feature maps unchanged during stylization for better structure preservation. 2) a feature combination operation to assign different weights to different frequency components of feature maps for desired stylization. We then conduct extensive experiments to validate their efficacy. The contributions of this paper are summarized as follows: • Equivalence We present a theoretically equivalent form for several state-of-the-art UST methods in the frequency domain and reveal their effects on frequencies. We conduct corresponding experiments to validate the equivalence. • Interpretability We connect Fourier amplitude and phase with key notions in style transfer and present new interpretations on different behaviors of UST methods. The interpretations are validated by experiments. • Controllability We propose two manipulations for structure preservation and desired stylization. We have experimental validation for their efficacy and controllability.

2. PRELIMINARIES

2.1 FOURIER TRANSFORM The Fourier transform has been widely used for the analysis of the frequency components in signals, including images and feature maps in the shallow layers of neural networks. Given an image F ∈ R C×H×W , the discrete Fourier transform (DFT) (Jenkins & Desai, 1986) decomposes it into a unique representation F ∈ C C×H×W in the frequency domain as follows: F u,v = H-1 h=0 W -1 w=0 F h,w e -j2π(u h H +v w W ) , j 2 = -1, where (h, w) and (u, v) are the indices on the spatial dimensions and the frequency dimensions, respectively. Since images and feature maps consist of multiple channels, we here apply the Fourier

