WAGI: WAVELET-BASED GAN INVERSION FOR PRESERVING HIGH-FREQUENCY IMAGE DETAILS Anonymous

Abstract

Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naïve low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., L 2 , is biased to reconstruct low-frequency features mainly. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables to handle high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to the existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.

1. INTRODUCTION

Recently, the inversion of Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) has dramatically improved by using the prior knowledge of powerful unconditional generators (Karras et al., 2019; 2020; 2021) for the robust and disentangled image attribute editing (Abdal et al., 2019; Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; b; Wang et al., 2022a) . The early GAN inversion models mostly rely on per-image optimization (Abdal et al., 2019; 2020; Zhu et al., 2020) , which is extremely time-consuming. For real-time inference, the encoder-based GAN inversion method becomes prevalent (Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; Moon & Park, 2022) , which trains an encoder that returns the corresponding GAN latent of an input image. The acquired latent from the encoder is desired to reproduce the input image as closely as possible. However, the encoder needs to compress the image into a small dimension, i.e., low-rate inversion. For instance, in the case of StyleGAN2 (Karras et al., 2020) , for encoding an image with the size 1024 2 × 3 the encoders return the corresponding latent with the size 18 × 512, which is extremely smaller than the original image dimension (about 0.3%). Due to the Information Bottleneck theory (Tishby & Zaslavsky, 2015; Wang et al., 2022a) , an attempt to encode information into a small tensor occurs severe information loss, and it deteriorates the image details, i.e., high-frequency features. To overcome this shortage, recent GAN inversion models propose new directions, such as finetuning the generator (Roich et al., 2021; Alaluf et al., 2021b) or directly manipulating the intermediate feature of the generative model (Wang et al., 2022a) to deliver more information using higher dimensional features than latents, i.e., high-rate inversion. However, results of high-rate inversion models are still imperfect. Figure 1 shows the inversion results of high-rate inversion models, Hyper-Style (Alaluf et al., 2021b), and HFGI (Wang et al., 2022a) . Though both models generally preserve coarse features, the details are distorted, e.g., boundaries of the letter and the shape of accessories. The aforementioned high-rate inversion models remarkably decrease distortion compared to the state-of-the-art low-rate inversion models, i.e., Restyle (Alaluf et al., 2021a) . However, this does not mean that distortion on every frequency spectrum is evenly decreased. To explicitly check distortion

