WAGI: WAVELET-BASED GAN INVERSION FOR PRESERVING HIGH-FREQUENCY IMAGE DETAILS Anonymous

Abstract

Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naïve low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., L 2 , is biased to reconstruct low-frequency features mainly. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables to handle high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to the existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.

1. INTRODUCTION

Recently, the inversion of Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) has dramatically improved by using the prior knowledge of powerful unconditional generators (Karras et al., 2019; 2020; 2021) for the robust and disentangled image attribute editing (Abdal et al., 2019; Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; b; Wang et al., 2022a) . The early GAN inversion models mostly rely on per-image optimization (Abdal et al., 2019; 2020; Zhu et al., 2020) , which is extremely time-consuming. For real-time inference, the encoder-based GAN inversion method becomes prevalent (Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; Moon & Park, 2022) , which trains an encoder that returns the corresponding GAN latent of an input image. The acquired latent from the encoder is desired to reproduce the input image as closely as possible. However, the encoder needs to compress the image into a small dimension, i.e., low-rate inversion. For instance, in the case of StyleGAN2 (Karras et al., 2020) , for encoding an image with the size 1024 2 × 3 the encoders return the corresponding latent with the size 18 × 512, which is extremely smaller than the original image dimension (about 0.3%). Due to the Information Bottleneck theory (Tishby & Zaslavsky, 2015; Wang et al., 2022a) , an attempt to encode information into a small tensor occurs severe information loss, and it deteriorates the image details, i.e., high-frequency features. To overcome this shortage, recent GAN inversion models propose new directions, such as finetuning the generator (Roich et al., 2021; Alaluf et al., 2021b) or directly manipulating the intermediate feature of the generative model (Wang et al., 2022a) to deliver more information using higher dimensional features than latents, i.e., high-rate inversion. However, results of high-rate inversion models are still imperfect. Figure 1 shows the inversion results of high-rate inversion models, Hyper-Style (Alaluf et al., 2021b), and HFGI (Wang et al., 2022a) . Though both models generally preserve coarse features, the details are distorted, e.g., boundaries of the letter and the shape of accessories. The aforementioned high-rate inversion models remarkably decrease distortion compared to the state-of-the-art low-rate inversion models, i.e., Restyle (Alaluf et al., 2021a) . However, this does not mean that distortion on every frequency spectrum is evenly decreased. To explicitly check distortion per frequency sub-bands, we adopt a wavelet transform, which enables to use of both frequency and spatial information. The wavelet transform yields a total of four coefficients by passing filters, which are in a low-pass filter set F l = {LL} and a high-pass filter set F h = {LH, HL, HH}. In Figure 2a , we visualize the coefficients obtained by each filter. In Figure 2b , we compare L 2 between the coefficients of ground truth images and inverted images, yielded by filter f , i.e., L 2,f . While the high-rate inversion models apparently decrease L 2,f for f ∈ F l , they marginally decrease or even increase L 2,f for f ∈ F h , compared to Restyle. In the light of this observation, we can argue that the existing methods of increasing the information rate can decrease distortion on the low-frequency sub-band, but are not effective for decreasing distortion on the high-frequency sub-band. Contributions First, we prove that the widely used loss term in GAN inversions, i.e., L 2 , is biased on low-frequency by using the wavelet transform. Then, we propose a simple wavelet-based GAN inversion model, named WaGI, which effectively lowers distortions on both the low-frequency and high-frequency sub-bands. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. Especially, we propose two novel terms for our model: (i) wavelet loss and (ii) wavelet fusion. First, (i) amplifies the loss of the high-frequency sub-band by using the wavelet coefficients from f ∈ F h . By using (i) at training, WaGI is proficient in reconstructing high-frequency details. Second, (ii) transfers the high-frequency features directly to the wavelet coefficients of the reconstructed image. Due to the wavelet upsampling structure of SWAGAN, we can explicitly manipulate the wavelet coefficients during the hierarchical upsampling. We demonstrate that WaGI shows outstanding results, compared to the existing state-of-the-art GAN inversion models (Alaluf et al., 2021b; Wang et al., 2022a) . We achieve the lowest distortion among the existing GAN inversion models on the inversion scenario. Moreover, qualitative results show the robust preservation of image-wise details of our model, both on the inversion and editing scenarios via InterFaceGAN (Shen et al., 2020) and StyleCLIP (Patashnik et al., 2021) . Finally, we elaborately show the ablation results and prove that each of our proposed methods is indeed effective.



Figure 1: Preserving details at the image inversion. Comparison of inversion results for the noisy image. The zoomed parts are regions that require delicate preservation of details. The existing GAN inversion models including recent high-rate inversion methods, such as generator tuning, e.g., HyperStyle, and feature mixing, e.g., HFGI, still struggles to restore high-frequency details.

