WAGI: WAVELET-BASED GAN INVERSION FOR PRESERVING HIGH-FREQUENCY IMAGE DETAILS Anonymous

Abstract

Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naïve low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., L 2 , is biased to reconstruct low-frequency features mainly. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables to handle high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to the existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.

1. INTRODUCTION

Recently, the inversion of Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) has dramatically improved by using the prior knowledge of powerful unconditional generators (Karras et al., 2019; 2020; 2021) for the robust and disentangled image attribute editing (Abdal et al., 2019; Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; b; Wang et al., 2022a) . The early GAN inversion models mostly rely on per-image optimization (Abdal et al., 2019; 2020; Zhu et al., 2020) , which is extremely time-consuming. For real-time inference, the encoder-based GAN inversion method becomes prevalent (Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; Moon & Park, 2022) , which trains an encoder that returns the corresponding GAN latent of an input image. The acquired latent from the encoder is desired to reproduce the input image as closely as possible. However, the encoder needs to compress the image into a small dimension, i.e., low-rate inversion. For instance, in the case of StyleGAN2 (Karras et al., 2020) , for encoding an image with the size 1024 2 × 3 the encoders return the corresponding latent with the size 18 × 512, which is extremely smaller than the original image dimension (about 0.3%). Due to the Information Bottleneck theory (Tishby & Zaslavsky, 2015; Wang et al., 2022a) , an attempt to encode information into a small tensor occurs severe information loss, and it deteriorates the image details, i.e., high-frequency features. To overcome this shortage, recent GAN inversion models propose new directions, such as finetuning the generator (Roich et al., 2021; Alaluf et al., 2021b) or directly manipulating the intermediate feature of the generative model (Wang et al., 2022a) to deliver more information using higher dimensional features than latents, i.e., high-rate inversion. However, results of high-rate inversion models are still imperfect. Figure 1 shows the inversion results of high-rate inversion models, Hyper-Style (Alaluf et al., 2021b) , and HFGI (Wang et al., 2022a) . Though both models generally preserve coarse features, the details are distorted, e.g., boundaries of the letter and the shape of accessories. The aforementioned high-rate inversion models remarkably decrease distortion compared to the state-of-the-art low-rate inversion models, i.e., Restyle (Alaluf et al., 2021a) . However, this does not mean that distortion on every frequency spectrum is evenly decreased. To explicitly check distortion per frequency sub-bands, we adopt a wavelet transform, which enables to use of both frequency and spatial information. The wavelet transform yields a total of four coefficients by passing filters, which are in a low-pass filter set F l = {LL} and a high-pass filter set F h = {LH, HL, HH}. In Figure 2a , we visualize the coefficients obtained by each filter. In Figure 2b , we compare L 2 between the coefficients of ground truth images and inverted images, yielded by filter f , i.e., L 2,f . While the high-rate inversion models apparently decrease L 2,f for f ∈ F l , they marginally decrease or even increase L 2,f for f ∈ F h , compared to Restyle. In the light of this observation, we can argue that the existing methods of increasing the information rate can decrease distortion on the low-frequency sub-band, but are not effective for decreasing distortion on the high-frequency sub-band. Contributions First, we prove that the widely used loss term in GAN inversions, i.e., L 2 , is biased on low-frequency by using the wavelet transform. Then, we propose a simple wavelet-based GAN inversion model, named WaGI, which effectively lowers distortions on both the low-frequency and high-frequency sub-bands. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. Especially, we propose two novel terms for our model: (i) wavelet loss and (ii) wavelet fusion. First, (i) amplifies the loss of the high-frequency sub-band by using the wavelet coefficients from f ∈ F h . By using (i) at training, WaGI is proficient in reconstructing high-frequency details. Second, (ii) transfers the high-frequency features directly to the wavelet coefficients of the reconstructed image. Due to the wavelet upsampling structure of SWAGAN, we can explicitly manipulate the wavelet coefficients during the hierarchical upsampling. We demonstrate that WaGI shows outstanding results, compared to the existing state-of-the-art GAN inversion models (Alaluf et al., 2021b; Wang et al., 2022a) . We achieve the lowest distortion among the existing GAN inversion models on the inversion scenario. Moreover, qualitative results show the robust preservation of image-wise details of our model, both on the inversion and editing scenarios via InterFaceGAN (Shen et al., 2020) and StyleCLIP (Patashnik et al., 2021) . Finally, we elaborately show the ablation results and prove that each of our proposed methods is indeed effective.

2.1. WAVELET TRANSFORM

Wavelet transform provides information on both frequency and spatiality (Daubechies, 1990) , which is crucial in the image domain. The most widely-used wavelet transform in deep learning-based image processing is the Haar wavelet transform, which contains the following simple four filters: LL = 1 1 1 1 , LH = -1 -1 1 1 , HL = -1 1 -1 1 , HH = 1 -1 -1 1 . ( ) Since the Haar wavelet transform enables reconstruction without information loss via inverse wavelet transform (Yoo et al., 2019) , it is widely used in image reconstruction-related tasks, e.g., super-resolution (Huang et al., 2017; Liu et al., 2018) and photo-realistic style transfer (Yoo et al., 2019) . In GAN inversion, image-wise details that cannot be generated via unconditional generator, e.g., StyleGAN (Karras et al., 2019; 2020) , should be transferred without information loss. Consequently, we argue that the Haar wavelet transform is appropriate for GAN inversion. To the best of our knowledge, our method is the first approach to combine the wavelet transform in GAN inversion.

2.2. FREQUENCY BIAS OF GENERATIVE MODELS

Recent works (Dzanic et al., 2019; Liu et al., 2020; Wang et al., 2020) tackle the spectral bias of GANs, in that GAN training is biased to learn the low-frequency distribution whilst struggling to learn the high-frequency counterpart. GANs suffer from generating high-frequency features due to incompatible upsampling operations in the pixel domain (Schwarz et al., 2021; Frank et al., 2020; Khayatkhoei & Elgammal, 2022) . In order to alleviate the spectral distortions, prior works (Schwarz et al., 2021; Durall et al., 2020; Jiang et al., 2021) propose a spectral regularization loss term to match the high-frequency distribution. However, we empirically find that images generated via spectral loss can contain undesirable high-frequency noises to coercively match the spectrum density, eventually degrading the image quality (see Appendix A.2.1). Currently, SWAGAN (Gal et al., 2021) is the first and only StyleGAN2-based architecture that generates images directly in the frequency domain to create fine-grained contents in the high-frequency range. The hierarchical growth of predicted wavelet coefficients contributes to the accurate approximation of spectral density, especially in the spatial domain for realistic visual quality. We adopt this wavelet-based architecture to preserve high-frequency information and mitigate spectral bias.

2.3. HIGH-RATE GAN INVERSION

The most popular high-rate GAN inversion methods are generator tuning and intermediate feature fusion. First, generator tuning is first proposed by Pivotal Tuning (Roich et al., 2021) , which finetunes the generator to lower distortion for input images. Since Pivotal Tuning needs extra training for every new input, it is extremely time-consuming. Recently, HyperStyle (Alaluf et al., 2021b) enables generator tuning without training, by using HyperNetwork (Ha et al., 2016) . Though Hy-perStyle effectively lowers distortion compared to low-rate inversion methods, it still encounters frequency bias. Second, intermediate feature fusion is proposed by HFGI (Wang et al., 2022a) . HFGI calculates the missing information of low-rate inversion and extracts feature vectors via a consultant encoder. The extracted feature is fused with the original StyleGAN feature, to reflect image-specific details into the decoding process. However, HFGI relies on the low-frequency biased loss term even when training the image-specific delivery module, which leads to frequency bias.

3. METHOD

In this section, we propose an effective Wavelet-based GAN Inversion method, named WaGI. In Section 3.1, we first introduce the notation and the architecture of WaGI. In Section 3.2, we prove the low-frequency bias in L 2 , using the Haar wavelet transform. In addition, to alleviate frequency bias, we propose a novel loss term, named wavelet loss. Lastly, we point out the limitation of existing feature fusion and propose wavelet fusion, which robustly transfers high-frequency features. at inference. From the aligned ∆, we can replenish the missing high-fidelity information with the two fusion modules F f eat and F wave . Fusion with each output is operated in the feature and spatial frequency domains in separate intermediate layers between lower layers G L 0 and higher layers G H 0 of G 0 , respectively. The final inversion result X contains rich information without the loss of high-frequency components. Note that ADA, F f eat , and F wave are all jointly trained, while E 0 and G 0 are frozen.

3.1. NOTATION AND ARCHITECTURE

Figure 3 shows our overall architecture. Since the goal of WaGI is to retain high-frequency details of an image X, we design our model to be a two-stage; naïve inversion and image-wise detail addition. First, the pre-trained SWAGAN generator G 0 and its pre-trained encoder E 0foot_0 can obtain low-rate latent, w = E 0 (X), and the corresponding naïve inversion X0 = G 0 (w). Due to the inherent limitation of low-rate inversion, X0 misses image-wise details, denoted by ∆ = X -X0 . Since the edited image, Xedit 0 , might have distorted alignment, e.g., a varied face angle, eye location, etc., ∆ should adapt to the alignment of Xedit 0 at the editing scenario. Hence, we train the Adaptive Distortion Alignment (Wang et al., 2022a ) (ADA) module that re-aligns ∆ to fit the alignment of Xedit 0 . At the training, due to the absence of edited images preserving image-wise details, we impose a self-supervised task by deliberately making a misalignment via a random distorted map to ∆, ∆ = RandAugment(∆) (Wang et al., 2022a) . The purpose of ADA is to minimize the discrepancy between output ∆ = ADA( X0 , ∆) and ∆. WaGI combines ∆ using the low-rate latent w with feature fusion, F f eat (Wang et al., 2022a) , and our proposed method, wavelet fusion, F wave . Both F f eat and F wave follow the linear gated scheme to filter out redundant information, which return the pairs (scale, shift), i.e., (g f eat , h f eat ) = F f eat ( ∆) and (g wave , h wave ) = F wave ( ∆), respectively. The pairs of g and h adaptively merge desired information from w and ∆. Refer to Section 3.3 for the detailed processes of fusions. The overall loss merges self-supervised alignment loss L ADA , wavelet loss L K wave between X and the final image X, and reconstruction loss L image (λ L2 , λ id , λ LP IP S ), consisting of the weighted sum of L 2 , LPIPS (Zhang et al., 2018) , and L id , with weights λ L2 , λ id , and λ LP IP S , respectively: L total = L ADA (∆, ∆) + L Image (λ L2 , λ id , λ LP IP S )(X, X) + λ wave L K wave (X, X). (2)

3.2. WAVELET LOSS

Majority of GAN inversion models use L Image (λ L2 , λ id , λ LP IP S ) with weights λ L2 > λ p > λ id , for comparing the ground truth and the reconstructed image (Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; Roich et al., 2021; Alaluf et al., 2021b; Wang et al., 2022a) . Combining the following theorem and our observation, we demonstrate that L 2 , which has the highest weight in L Image , is biased on the low-frequency sub-band in terms of the Haar wavelet transform: Theorem 1. The following equation holds when λ f = 1, ∀f ∈ F l ∪ F h : L 2 (I 1 , I 2 ) = f ∈ F l ∪ F h λ f L 2,f (I 1 , I 2 ), where I 1 and I 2 are arbitrary image tensors. Proof is in Appendix A.1. We can derive a following simple lemma using Theorem 1. Lemma 1.1. When the distributions of pixel-wise differences between I 1 and I 2 are i.i.d., and follow N (µ, σ 2 ) with µ ≈ 0, the following equation holds when λ f = 1 4 , ∀f ∈ F l ∪ F h : log E[L 1 (I 1 , I 2 )] + C = f ∈ F l ∪ F h λ f log E[L 1,f (I 1 , I 2 )], where C is a constant. Proof is in Appendix A.1. In Theorem 1, since L 2 reflects L 2,f ∈ F l ∪ F h with the same weight, it seems a fair loss without frequency bias. However, as shown in Figure 2b , we empirically find that L 2,LL of the existing GAN inversion models, e.g., Restyle (Alaluf et al., 2021a) , HyperStyle (Alaluf et al., 2021b) , and HFGI (Wang et al., 2022a) , have around 30 times larger value than L 2,f ∈ F h , on average. The same logic can be applied to L 1 , which shows the extreme low-frequency bias while using the same λ f for the equation, as shown in Lemma 1.1 (Please refer to Appendix A.1 for the detailed descriptions). Consequently, we can argue that λ f ∈ F h should be higher than λ LL to avoid the low-frequency bias. In the light of this observation, we design a novel loss term, named wavelet loss, to focus on the high-frequency details. The wavelet loss L wave between I 1 and I 2 is defined as below: L wave (I 1 , I 2 ) = f ∈ F h L 2,f (I 1 , I 2 ). I 1 and I 2 in equation 5 only pass f ∈ F h , which discards the sub-bands with the frequency below f nyq /2, where f nyq is the Nyquist frequency of the image. However, we empirically find that a substantial amount of image details are placed below f nyq /2. In other words, L wave should cover a broader range of frequency bands. To this end, we improve L wave combining with multi-level wavelet decomposition (Liu et al., 2018; Yoo et al., 2019) , which can subdivide the frequency ranges by iteratively applying four filters to the LL sub-bands. We improve L wave with the K-level wavelet decomposition, named L K wave : L K wave (I 1 , I 2 ) = K i=0 f ∈ F h L 2,f ((I 1 * LL (i) ), (I 2 * LL (i) )), where LL (i) stands for iteratively applying LL for i times for multi-level wavelet decomposition. L K wave can cover the image sub-bands with the frequency ranges between f nyq /2 K+1 and f nyq . Now we propose that L K wave is especially helpful for training modules that address high-frequency details, e.g., ADA. The existing ADA in HFGI uses L 1 ( ∆, ∆), but we find that this loss term is not enough to focus on high-frequency features. Since ADA targets to re-align high-frequency details to fit the alignment of the edited image, high-frequency details should be regarded for both purposes; fitting alignment and preserving details. Consequently, we add the wavelet loss to train ADA: L ADA (I 1 , I 2 ) = L 1 (I 1 , I 2 ) + λ wave,ADA L K wave (I 1 , I 2 ). We set λ wave,ADA = 0.1 and K = 2 for training.

3.3. WAVELET FUSION

To prevent the generator from relying on the low-rate latent, i.e., w ∈ W +, we should delicately transfer information from ∆ to the generator. For instance, HFGI (Wang et al., 2022a) extracts scale g f eat and shift h f eat using ∆, and fuse them with F ℓ f , the original StyleGAN intermediate feature at layer ℓ f , and the latent at layer ℓ f , w ℓ f as follows: F ℓ f +1 = g f eat • ModConv(F ℓ f , w ℓ f ) + h f eat . Though feature fusion is helpful for preserving the image-specific details, the majority of image details, e.g., exact boundaries, are still collapsed (see Section 4 for more details). We attribute this to the low-resolution of feature fusion. In the case of HFGI, the feature fusion is only applied to the resolution of 64, i.e., ℓ f = 7. According to the Shannon-Nyquist theorem, the square image I ∈ R H×W , where H = W = l, cannot store the information with the frequency range higher than f nyq = √ H 2 + W 2 = l √ 2. Consequently, the upper bound of information frequency by the feature fusion of HFGI is f nyq = 64 √ 2, which is relatively lower than the image size, 1024. To solve this, we modify the feature fusion to be done on both 64 and 128 resolutions, i.e., ℓ f = 7 and 9, respectively, which means f nyq is doubled. However, a simple resolution increment cannot address the problem totally (see Section 4.3 for more details). Since feature fusion goes through the pre-trained convolution of the generator, the degradation of image-specific details is inevitable. To avoid degradation, we propose a novel method, named wavelet fusion. Wavelet fusion directly handles the wavelet coefficients instead of the generator feature. Using the hierarchical upsampling structure of SWAGAN which explicitly constructs the wavelet coefficients, wavelet fusion can transfer high-frequency knowledge without degradation. Similar with feature fusion, our model obtains scale g wave and shift h wave using ∆, and fuse them with the wavelet coefficients obtained by the layer, tWavelets as follows: W ℓw = g wave • tWavelets(F ℓw ) + h wave , where W ℓw are the wavelet coefficients of the ℓ w -th layer. We empirically find that feature fusion is helpful for reconstructing the coarse shape, while wavelet fusion is helpful for reconstructing the fine-grained details. Consequently, we apply feature fusion for the earlier layers (ℓ f = 7 and 9) than wavelet fusion (ℓ w = 11).

4. EXPERIMENTS

In this section, we compared the results of WaGI with various GAN inversion methods. We used the widely-used low-rate inversion models, e.g., pSp (Richardson et al., 2021) , e4e (Tov et al., 2021) , and Restyle (Alaluf et al., 2021a) , together with the state-of-the-art high-rate inversion models, e.g., HyperStyle (Alaluf et al., 2021b) and HFGI (Wang et al., 2022a) as baselines. First, we compared the quantitative performance of inversion results. Next, we compared the qualitative results of inversion and editing scenarios. For editing, we used InterFaceGAN (Shen et al., 2020) and StyleCLIP (Patashnik et al., 2021) to manipulate the latents. Finally, we analyzed the effectiveness of our wavelet loss and wavelet fusion. We conducted all the experiments in the human face domain: we used the Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019) for training and the CelebA-HQ dataset (Karras et al., 2017) for evaluation, with all images generated to the highresolution 1024×1024. We trained our own e4e encoder, InterFaceGAN boundaries, and StyleCLIP for each attribute to exploit the latent space of SWAGAN (Gal et al., 2021) .

4.1. QUANTITATIVE EVALUATION

We first evaluated our inversion qualities with the existing baseline inversion methods. Table 1 shows the quantitative results of each method. We used all 3k images in the test split of the CelebA-HQ dataset and evaluated them with (i) the standard objectives including L 2 , LPIPS, SSIM, and ID similarity, and (ii) the wavelet loss (equation 6) to measure the spatial frequency distortion in the Figure 4 : Qualitative comparison between inversion results of baselines. The baseline models including the state-of-the-art high-rate inversion models failed to robustly preserve details, such as accessories and complex backgrounds. In contrast, the inverted images through WaGI showed robust reconstruction of image-wise details, e.g., legible letters on the second and third rows. WaGI outperformed all of the baselines consistently with a large margin in the various metrics, related to both reconstruction quality and fidelity. Quantitative results showed that our model learned the ground truth frequency distribution most accurately, without loss of identity and perceptual quality. Method L 2 ↓ L wave ↓ LPIPS ↓ SSIM ↑ ID sim ↑ pSp ( high-frequency sub-band. Our model consistently outperformed all of the baselines with a large margin. The results empirically prove that our wavelet loss and wavelet fusion in the spatial frequency domain are capable of minimizing the spectral distortion and thus improving the reconstruction quality. We note, interestingly, that our model recorded the lowest LPIPS loss despite the explicit manipulation of intermediate wavelet coefficients of the generator.

4.2. QUALITATIVE EVALUATION

We show the qualitative comparisons of inversion and editing results in Figure 4 and Figure 5 , respectively. We observed that our model produced more realistic quality of inversion results than baselines, especially when images required more reconstructions of fine-grained details or complex backgrounds. For instance, Hyperstyle (Alaluf et al., 2021b) , which refined the weights of the generator per image, failed to reconstruct the out-of-distribution objects, e.g., earrings, and camera. HFGI (Wang et al., 2022b) could generate most of the lost details from the initial inversion via feature fusion, but restored details were close to artifacts, e.g., letters in the background. In contrast, our method was solely capable of reconstructing the details with minimum distortion consistently. Figure 5 shows the editing results for seven attributes, manipulated by InterFaceGAN directions (Shen et al., 2020) and StyleCLIP (Patashnik et al., 2021) . Our model consistently showed the most robust inversion results with high editability, while the baselines failed to edit images and restore details simultaneously. For instance, in the case of InterFaceGAN, the baselines struggled to preserve details, e.g., the hat in the second row or showed low editability, e.g., HFGI in the third From the first to the fifth rows, we show edited images via InterFaceGAN directions, and editing results for StyleCLIP from the six to seven rows. Both low-rate and high-rate inversion baselines suffer from preserving details, e.g., letters, backgrounds, and hats. HFGI, which relatively restores details among baselines, fails to edit the image in a disentangled way, e.g., distortion of eye shapes for editing with "beard". Our proposed method efficiently restores high-fidelity details with satisfactory editability with highly disentangled editing performance, throughout all various scenarios. row. In the case of StyleCLIP, similar to InterFaceGAN, the baselines failed to preserve details, e.g., eyeglass at the sixth row, or showed undesirable entanglement, e.g., identity shift of HFGI at the sixth and seventh rows. Overall, both InterFaceGAN and StyleCLIP editing results showed that our model is the most capable of handling the trade-off between reconstruction quality and editability.

4.3. ABLATION STUDY

In this section, we analyzed the effectiveness of each component of WaGI, especially for our wavelet loss and wavelet fusion. In Table 2 , we quantitatively compared the performance while adding each component we proposed, to the state-of-the-art GAN inversion model, HFGI (Wang et al., 2022a) . We compared L 1 through ∆ for the evaluation of ADA, and L 2 and SSIM through X

HFGI + SWAGAN + Wavelet loss + Wavelet fusion Ground truth

Figure 6 : Qualitative ablation of WaGI. We compared the performance of ADA and the final inversion from the visualization of ∆ and X, together with ∆ and X. While the existing state-ofthe-art model introduced severe artifacts for computing ∆, our proposed methods reduced artifacts effectively and showed better inversion quality. for the evaluation of the final inversion. Firstly, a simple variation of altering the generator from StyleGAN2 (Karras et al., 2020) to SWAGAN (Gal et al., 2021 ) (Config B) made only marginal gains for both ADA and the final inversion. We attribute this gain to the characteristics of SWAGAN, which delicately generates the high-frequency more than StyleGAN. In Config C, training with the wavelet loss achieved remarkably lower L 1 (∆, ∆) (about 19%) than training solely with L 1 . This is compelling that training together with the wavelet loss achieved lower L 1 than training solely with L 1 . From this observation, we can argue that the wavelet loss is not only helpful for preserving the high-frequency sub-band, but also for reducing distortion on the low-frequency sub-band. Moreover, as shown in Figure 6 , the wavelet loss is indeed helpful for preserving details, e.g., nail colors, compared to Config B. Finally, when wavelet fusion is combined to C (Config D), i.e., WaGI, there was the apparent gain on both ADA and the final inversion. Until C, though we elaborately calculated high-frequency features, i.e., ∆, the model did not effectively transfer it to the generator. However, wavelet fusion enables information transfer to the generator, resulting in the improvement of the inversion. The gains can be shown apparently in the qualitative ways, as in Figure 6 . Configuration L 1 (∆, ∆) ↓ L 2 (X, X) ↓ SSIM (X,

5. CONCLUSION

Recent high-rate GAN inversion methods focus on preserving image-wise details but still suffer from the low-frequency bias. We point out that the existing methods are biased on low-frequency subband, in both structural and training aspects. To overcome this, we proposed a novel GAN inversion, named WaGI, which explicitly handles the wavelet coefficients of the high-frequency sub-band via wavelet loss and wavelet fusion. We demonstrated that WaGI achieved the best performances among the state-of-the-art GAN inversion methods. Moreover, we explored the effectiveness of each of the proposed methods through the elaborate ablation study. Since our framework is simple and can be simply reproduced with our released code, we look forward to its wide usage in future work.

A APPENDIX

A.1 PROOFS FOR THE THEOREM In this section, we show the proof for the proposed theorem and lemma in the main paper; Theorem 1. The following equation holds when λ f = 1, ∀f ∈ F l ∪ F h : L 2 (I 1 , I 2 ) = f ∈ F l ∪ F h λ f L 2,f (I 1 , I 2 ). ( ) Proof. Let I 1 = (a ij ) ∈ R m×n , and I 2 = (b ij ) ∈ R m×n . And for the simplicity of the notation, we define c i,j = a i,j -b i,j . Then, L 2 (I 1 , I 2 ) = 1 mn i∈{0,1,...m} j∈{0,1,...n} (c ij ) 2 . Let m ′ = [ m 2 ], n ′ = [ n 2 ]. Then, we can denote L 2,f ∈ F l ∪ F h as below; L 2,LL (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } (c 2i+1,2j+1 + c 2i+1,2j + c 2i,2j+1 + c 2i,2j ) 2 , L 2,LH (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } (-c 2i+1,2j+1 -c 2i+1,2j + c 2i,2j+1 + c 2i,2j ) 2 , L 2,HL (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } (-c 2i+1,2j+1 + c 2i+1,2j -c 2i,2j+1 + c 2i,2j ) 2 , L 2,HH (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } (c 2i+1,2j+1 -c 2i+1,2j -c 2i,2j+1 + c 2i,2j ) 2 . We can rewrite L 2 (I 1 , I 2 ) as; L 2 (I 1 , I 2 ) = 4 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } c 2 2i+1,2j+1 + c 2 2i+1,2j + c 2 2i,2j+1 + c 2 2i,2j . We use the following identical equation, which holds for ∀x, y, z, w ∈ R; (x+y +z +w) 2 +(-x-y +z +w) 2 +(-x+y -z +w) 2 +(x-y -z +w) 2 = 4(x 2 +y 2 +z 2 +w 2 ). We can obtain the followings; f ∈ F l ∪ F h L 2,f (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } 4(c 2 2i+1,2j+1 +c 2 2i+1,2j +c 2 2i,2j+1 +c 2 2i,2j ). ∴ L 2 (I 1 , I 2 ) = f ∈ F l ∪ F h 1 • L 2,f (I 1 , I 2 ). Lemma 1.1. When the distributions of pixel-wise differences between I 1 and I 2 are i.i.d., and follow N (µ, σ 2 ) with µ ≈ 0, the following equation holds when λ f = 1 4 , ∀f ∈ F l ∪ F h : log E[L 1 (I 1 , I 2 )] + C = f ∈ F l ∪ F h λ f log E[L 1,f (I 1 , I 2 )], ( ) where C is a constant. Proof. Similar with proving Theorem 1, we can derive followings; L 1 (I 1 , I 2 ) = 4 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } |c 2i+1,2j+1 | + |c 2i+1,2j | + |c 2i,2j+1 | + |c 2i,2j |, L 1,LL (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } |c 2i+1,2j+1 + c 2i+1,2j + c 2i,2j+1 + c 2i,2j |, L 1,LH (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } | -c 2i+1,2j+1 -c 2i+1,2j + c 2i,2j+1 + c 2i,2j |, L 1,HL (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } | -c 2i+1,2j+1 + c 2i+1,2j -c 2i,2j+1 + c 2i,2j |, L 1,HH (I 1 , I 2 ) = 1 m ′ n ′ i∈{0,1,...m ′ } j∈{0,1,...n ′ } |c 2i+1,2j+1 -c 2i+1,2j -c 2i,2j+1 + c 2i,2j |. Using c i,j ∼ N (µ, σ 2 ), we can obtain followings: (c 2i+1,2j+1 + c 2i+1,2j + c 2i,2j+1 + c 2i,2j ) ∼ N (4µ, 4σ 2 ), (-c 2i+1,2j+1 -c 2i+1,2j + c 2i,2j+1 + c 2i,2j ) ∼ N (0, 4σ 2 ), (-c 2i+1,2j+1 + c 2i+1,2j -c 2i,2j+1 + c 2i,2j ) ∼ N (0, 4σ 2 ), (c 2i+1,2j+1 -c 2i+1,2j -c 2i,2j+1 + c 2i,2j ) ∼ N (0, 4σ 2 ). According to the properties of half-normal distribution, for p ∼ N (µ, σ 2 ), E[|p|] = σ 2 π e -µ 2 2σ 2 + µ • erf( µ (2σ 2 ) ), where erf(x) = x 0 e -t 2 dt. Consequently, Using the condition that c i,j s are i.i.d., E[|c 2i+1,2j+1 | + |c 2i+1,2j | + |c 2i,2j+1 | + |c 2i,2j |] = 4σ 2 π e -µ 2 2σ 2 + 4µ • erf( µ (2σ 2 ) ), E[|c 2i+1,2j+1 + c 2i+1,2j + c 2i,2j+1 + c 2i,2j |] = 2σ 2 π e -4• µ 2 2σ 2 + 4µ • erf( 2µ (2σ 2 ) ), E[| -c 2i+1,2j+1 -c 2i+1,2j + c 2i,2j+1 + c 2i,2j |] = 2σ 2 π , E[| -c 2i+1,2j+1 + c 2i+1,2j -c 2i,2j+1 + c 2i,2j |] = 2σ 2 π , E[|c 2i+1,2j+1 -c 2i+1,2j -c 2i,2j+1 + c 2i,2j |] = 2σ 2 π . E[L 1 (I 1 , I 2 )] = 4 m ′ n ′ • m ′ n ′ • (4σ 2 π e -µ 2 2σ 2 + 4µ • erf( 2µ √ 2σ 2 )) = 16σ 2 π e -µ 2 2σ 2 + 16µ • erf( 2µ √ 2σ 2 ), E[L 1,LL (I 1 , I 2 )] = 1 m ′ n ′ • m ′ n ′ • (2σ 2 π e -4• µ 2 2σ 2 + 4µ • erf( 2µ (2σ 2 ) )) = 2σ 2 π e -4• µ 2 2σ 2 + 4µ • erf( µ (2σ 2 ) ), E[L 1,LH (I 1 , I 2 )] = E[L 1,HL (I 1 , I 2 )] = E[L 1,HH (I 1 , I 2 )] = 1 m ′ n ′ • m ′ n ′ • 2σ 2 π = 2σ 2 π . Since µ ≈ 0, µ • erf( µ √ 2σ 2 ) ≈ 0. Consequently, log E[L 1 (I 1 , I 2 )] = log 16 + log(σ 2 π ) - µ 2 2σ 2 , log E[L 1,LL (I 1 , I 2 )] = log 2 + log(σ 2 π ) -4 • µ 2 2σ 2 , log E[L 1,LH (I 1 , I 2 )] = log E[L 1,HL (I 1 , I 2 )] = log E[L 1,HH (I 1 , I 2 )] = log 2 + log(σ 2 π ). ∴ log E[L 1 (I 1 , I 2 )] = f ∈ F l ∪ F h 1 4 log E[L 1,f (I 1 , I 2 )] + C. Similar with L 2 , L 1 seems a fair loss without the frequency bias, which reflects L 1,f ∈ F l ∪ F h with same weights. However, as shown in Figure 7 , we empirically find that L 1,LL is around 30 times larger than L 1,f ∈ F h in case of HyperStyle and HFGI. This leads to the biased training, which results in apparent decrease of L 1,LL , but almost no gain, or even increment of L 1,f ∈ F h , compared to Restyle. Consequently, we argue that L 1 contains the low-frequency bias, and needs the wavelet loss to avoid it. We recommend you zoom in to carefully observe the reconstructed details. Previous works (Schwarz et al., 2021; Jiang et al., 2021; Durall et al., 2020) each propose an objective function to precisely learn the frequency distribution of the training data, which we comprehensively named them as spectral loss. Jiang et al. ( 2021) designed a spectral loss function that measures the distance between fake and real images in the frequency domain that captures both amplitude and phase information. Durall et al. (2020) proposed a spectral loss that measures the binary cross entropy between the azimuthal integration over power spectrum of fake and real images. Schwarz et al. ( 2021) used a simple L 2 loss between the logarithm of the azimuthal average over power spectrum in normalized polar coordinates, i.e.reduced spectrum, of fake and real images. We adopted the spectral loss term of Schwarz et al. (2021) for our experiment : L S = 1 H/ √ 2 H/ √ 2-1 k=0 ∥ log ( S(G(z)))[k] -log ( S(I))[k]∥ 2 2 , ( ) where S is the reduced spectrum, G(z) is the generated image, and I is the ground truth real image. Here, we conducted a single-image reconstruction task, which is widely done (Gal et al., 2021; Schwarz et al., 2021) to investigate the effectiveness of explicit frequency matching in refining highfidelity details. For StyleGAN2 (Karras et al., 2020) and SWAGAN (Gal et al., 2021) generator, we used the latent optimization (Karras et al., 2019) method to reconstruct a single image, each with and without the spectral loss. All images are generated to resolution 512 × 512, with the weight of spectrum loss ×0.1 of the original L 2 loss. Figure 8 shows the reconstructed images and spectral density plots for each case. As seen in Figure 8 (a), the spectrum of a natural image follows a exponentially decay. Using L 2 singularly made both StyleGAN2 and SWAGAN generators overfit to the mostly existing low-frequency distribution. (b) StyleGAN2 struggled to learn the high-fidelity details, creating an unrealistic image. (d) SWAGAN was capable of fitting most of the high-frequency parts, except created some excessive high frequency noise due to checkerboard patterns. Though utilizing the spectral loss for both generators (c,e) exquisitely matched all frequency distribution, qualitative results were degraded. Matching the frequency induced unwanted artifacts to the images, and caused the degradation. Due to the absence of the spatial information, the loss based on the spectral density inherently cannot robustly reconstruct high-frequency details. Comparably, our wavelet loss minimizes the L 1 distance of high-frequency bands in the spatial frequency domain, restoring meaningful high-fidelity features.

A.2.2 INFORMATION IN SUB-BAND OF IMAGES

In Section 3.2, we designed the multi-level wavelet loss to cover broader frequency ranges than f nyq /2 ∼ f nyq . In Figure 9 , we show the results of the inverse wavelet transform by omitting the In our experiments, we implement our experiments based on the pytorch-version codefoot_2 for SWA-GAN (Gal et al., 2021) . We converted the weights of pretrained SWAGAN generator checkpoint from the official tensorflow codefoot_3 to a pytorch version. We trained our model on a single GPU and took only 6 hours for the validation loss to saturate, whereas other StyleGAN2-based baselines required more than 2 days of training time. Here, we explain the details of our reconstruction loss terms: L 2 , L id , and L LP IP S . We leverage L 2 , as it is most effective in keeping the generated image similar to the original image pixel-wise. L id is an identity loss function defined as : L id = 1-< R(G 0 (w)), R(I) >, ( ) where R is the pre-trained ArcFace (Deng et al., 2021) model, and I is the ground truth image. L id minimizes the cosine distance between two face images to preserve the identity . LPIPS (Zhang et al., 2018) enhances the perceptual quality of the image via minimizing the feature space of generated images and the feature space of ImageNet (Deng et al., 2009) pre-trained network. For training, we used weights λ L2 =1, λ id =0.1, λ LP IP S =0.8, respectively.

A.3.2 DATASET DESCRIPTION

Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019) . Our model and all baselines are trained with FFHQ, a well-aligned human face dataset with 70,000 images of resolution 1024 × 1024. FFHQ dataset is widely used for training various unconditional generators (Karras et al., 2019; 2020; 2021) , and GAN inversion models (Richardson et al., 2021; Tov et al., 2021; Alaluf et al., 2021a; b; Moon & Park, 2022; Wang et al., 2022a) . All of the baselines we used in the paper use the FFHQ dataset for training, which enables a fair comparison. CelebA-HQ (FFHQ) dataset CelebA-HQ dataset contains 30,000 human facial images of resolution 1024 × 1024, together with the segmentation masks. Among 30,000 images, around 2,800 images are denoted as the test dataset. We use the official split for the test dataset, and evaluate every baseline and our model with all images in the test dataset.

A.3.3 BASELINE MODELS DESCRIPTION

In this section, we describe the existing GAN inversion baselines, which we used for comparison in Section 4. We exclude the model which needs image-wise optimization, such as Image2StyleGAN Abdal et al. (2019) or Pivotal Tuning Roich et al. (2021) . pSp pixel2Style2pixel (pSp) adopts pyramid (Lin et al., 2017) network for the encoder-based GAN inversion. pSp achieves the state-of-the-art performance among encoder-based inversion models at the time. Moreover, pSp shows the various adaptation of the encoder model to the various tasks using StyleGAN, such as image inpainting, face frontalization, or super-resolution. e4e encoder4editing (e4e) proposes the existence of the trade-off between distortion and the perception-editability of the image inversion. In the other words, e4e proposes that the existing GAN inversion models which focus on lowering distortion, sacrifice the perceptual quality of inverted images, and the robustness on the editing scenario. e4e suggests that maintaining the latent close to the original StyleGAN latent space, i.e., W , enables the inverted image to have high perceptual quality and editability. To this end, e4e proposes additional training loss terms to keep the latent close to W space. Though distortion of pSp is lower than e4e, e4e shows apparently higher perceptual quality and editabilty than pSp. Restyle Restyle suggests that a single feed-forward operation of existing encoder-based GAN inversion models, i.e., pSp and e4e, is not enough to utilize every detail in the image. To overcome this, Restyle proposes an iterative refinement scheme, which infers the latent with feed-forwardbased iterative calculation. The lowest distortion that Restyle achieves among encoder-based GAN inversion models shows the effectiveness of the iterative refinement scheme. Moreover, the iterative refinement scheme can be adapted to both pSp and e4e, which enables constructing models that have strengths in lowering distortion, or high perceptual quality-editability, respectively. To the best of our knowledge, Restyle pSp achieves the lowest distortion among encoder-based models which do not use generator-tuning methodfoot_4 . Since we utilize baselines that achieves lower distortion than Restyle pSp , i.e., HyperStyle and HFGI, we only use Restyle e4e to evaluate its high editability. HyperStyle To make a further improvement from Restyle, Pivotal Tuning (Roich et al., 2021) uses the input-wise generator tuning. However, this is extremely time-consuming, and inconvenient in that it requires separate generators per every input image. To overcome this, HyperStyle adopts Hy-perNetwork (Ha et al., 2016) , which enables tuning the convolutional weights of pre-trained Style-GAN only with the feed-forward calculation. Starting from the latent obtained by e4e, HyperStyle iteratively refines the generator to reconstruct the original image with the fixed latent. HyperStyle achieves the lowest distortion among encoder-based GAN inversion models at the time. HFGI HFGI points out the limitation of the low-rate inversion methods and argues that encoders should adopt larger dimensions of tensors to transfer high-fidelity image-wise details. To achieve this, HFGI adapts feature fusion, which enables mixing the original StyleGAN feature with the feature obtained by the image-wise details. To the best of our knowledge, HFGI achieves the lowest distortion among every GAN inversion method, even including Pivotal Tuning, except our model.

A.3.4 ABLATION STUDIES

Choice of fusion layer. We additionally provide both quantitative and qualitative ablation results for the inversion performance of WaGI with fusion in different layers. Note that in our main experiment, we apply feature fusion in layers ℓ f = 7 and 9, and wavelet fusion in layer ℓ w = 11. Each layer corresponds to fusion of spatial features with resolution 64 × 64 and 128 × 128, and wavelet coefficients of dimension w ∈ R 12×128×128 . From the quantitative results in Table 3 , we observed that the feature fusion on two layers ℓ f = 7 and 9 showed better reconstruction accuracy than on a single layer ℓ f = 7. Additionally, wavelet fusion in lower layers (ℓ w < 11) was not sufficient enough to preserve the high-fidelity details, especially in the high-frequency region i.e.L wave . Wavelet fusion in the higher layer (ℓ w = 13) also degraded the inversion performance, which can be more carefully observed in Figure 10 . Figure 10 shows the inverted images for each scenario in Table 3 . It is noticeable that fusion in a single layer (a)-(d) failed to retain high-frequency details like the hand and hair texture. Comparably, in the case of multi-layer feature fusion (e)-(h), inverted images reconstructed more high-frequency details. Yet, wavelet fusion in the lower layers (e), (g), and higher layer (h) generated unwanted distortions, which eventually degraded the image fidelity. Overall, our scenario (g) empirically showed the most promising reconstruction quality, generating realistic images with the least distortion. Design of fusion methods. To prove the effectiveness of the wavelet fusion, we compared the performance of WaGI with the model which uses the feature fusion, proposed in HFGI (Wang et al., 2022a) , instead of the wavelet fusion in the same resolution layer. In Table 4 , we compared the performance of models with the following four settings: The original HFGI which uses feature fusion at l f = 7, HFGI with additional feature fusion at l f = 9 and 11, WaGI with the feature fusion at l f = 7, 9, and 11, and the original WaGI which uses the feature fusion at l f = 7 and 9, and the wavelet fusion at l w = 11. First, simply adding the feature fusion to the higher layer is not helpful for improving the model. If we change it to the WaGI method, i.e., change the generator and add the wavelet loss, the performance significantly improved. And after changing the feature fusion at the 



We used e4e(Tov et al., 2021) for E0. https://github.com/rosinality/stylegan2-pytorch https://github.com/rinongal/swagan IntereStyle(Moon & Park, 2022) achieves lower distortion on the interest region than Restyle pSp , but not for the whole image region.



Figure 1: Preserving details at the image inversion. Comparison of inversion results for the noisy image. The zoomed parts are regions that require delicate preservation of details. The existing GAN inversion models including recent high-rate inversion methods, such as generator tuning, e.g., HyperStyle, and feature mixing, e.g., HFGI, still struggles to restore high-frequency details.

Figure 3: Training scheme of Wavelet-based GAN Inversion (WaGI). Given a pre-trained encoder E 0 and generator G 0 , we can obtain an initial inverted image X0 . The residual ∆ contains highfidelity details that X0 misses. The model leverages a trainable ADA module to align the residual, which should ultimately be in alignment with X0 or the edited image Xedit0

Figure5: Qualitative comparison between editing results of baselines. From the first to the fifth rows, we show edited images via InterFaceGAN directions, and editing results for StyleCLIP from the six to seven rows. Both low-rate and high-rate inversion baselines suffer from preserving details, e.g., letters, backgrounds, and hats. HFGI, which relatively restores details among baselines, fails to edit the image in a disentangled way, e.g., distortion of eye shapes for editing with "beard". Our proposed method efficiently restores high-fidelity details with satisfactory editability with highly disentangled editing performance, throughout all various scenarios.

Figure 7: Comparison of L 1 of the wavelet coefficients. We plot the average L 1 of each wavelet coefficient between CelebA-HQ test images and corresponding inverted images by various state-ofthe-art inversion models. Due to the significant gap between L 1,LL and the rest (about 30 times), we display the losses with the logarithmic scale for better visualization.

Figure 8: Regression of a single image (top row) and spectral density plot of ground truth image and generated images (bottom row) trained with/without additional spectral loss. Here, we used the spectral loss introduced in Schwarz et al. (2021). For both StyleGAN2 and SWAGAN generators, the additional spectral loss induced artifacts to coercively match the frequency distribution. We recommend you zoom in to carefully observe the reconstructed details.

Figure 9: Inverse wavelet transform results by omitting various wavelet sub-bands. To check the qualitative image details in each sub-band, we remove the wavelet coefficients between f nyq /2 ∼ f nyq (Config A), f nyq /2 2 ∼ f nyq /2 (Config B), and f nyq /2 3 ∼ f nyq /2 2 (Config C). From A to B, severe degradation of visible image details do not occur. However, for B to C or C to D, majority of image details are degraded.

Figure 10: Quantitative comparison of WaGI inversion with fusion in different layers. Each image represents the inversion results for each scenario in Table3. The first row (a)-(d) displays inverted images with feature fusion in a single layer ℓ f = 7, with wavelet fusion in layer ℓ w = 7, ℓ w = 9, ℓ w = 11, and ℓ w = 13, respectively. The second row (e)-(h) displays inverted images with feature fusion in multi-layers ℓ f = 7 and 9, with wavelet fusion in layer ℓ w = 7, ℓ w = 9, ℓ w = 11, and ℓ w = 13, respectively. We recommend you zoom in for a careful look into the details.

Figure 12: Qualitative comparison between inversion and editing results of baselines. From the first to the third rows, we show the inverted images. The inverted image from WaGI preserves the most high-frequency details like the texture of clothes and skin. From the fourth and eighth rows, we show edited images of various attributes. WaGI is the only method capable of editing images without loss of high-frequency information.

Quantitative comparison between inversion results of baselines.

Quantitative

wavelet coefficients between f nyq /2 ∼ f nyq (Config A), f nyq /2 2 ∼ f nyq /2 (Config B), and f nyq /2 3 ∼ f nyq /2 2 (Config C). Though A removes the highest frequency sub-bands, i.e., f nyq /2 ∼ f nyq , among all configs, we cannot find visible degradation of image details. In other words, information in the sub-band f nyq /2 ∼ f nyq is mostly higher than the visible image details. Since the firstly proposed wavelet loss in Equation5only covers the sub-band f nyq /2 ∼ f nyq , we should extend the range of sub-bands to effectively preserve the visible details. Consequently, we propose a K-level wavelet loss, which enables covering the sub-band f nyq /2 K+1 ∼ f nyq .

Feature fusion Wavelet fusion L 2 ↓ L wave ↓ LPIPS ↓ SSIM ↑ ID sim ↑ Ablation of the fusion layers for WaGI. We compared the inversion performance of WaGI with feature and wavelet fusion in different layers. Feature fusion on layers ℓ f = 7 and 9, and wavelet fusion on layer ℓ w = 11 consistently showed the highest fidelity and reconstruction quality among all scenarios.

Model

L 2 ↓ L wave ↓ SSIM ↑ ID sim ↑ HFGI (ℓ f = 7) 0.023 0.351 0.661 0.864 HFGI (ℓ f = 7, 9, 11) 0.036 0.377 0.704 0.795 WaGI (ℓ f = 7, 9, 11) 0.017 0.302 0.699 0.873 WaGI (ℓ f = 7, 9 and ℓ w = 11) 0.011 0.230 0.753 0.906Table 4 : Ablation of the fusion methods for WaGI. We compared the inversion performance of WaGI with the model which uses wavelet fusion instead of feature fusion. Though changing all the fusion methods with the feature fusion achieves better results than HFGI, still it shows a big performance degradation compared to WaGI. In this section, we show the additional experimental results of WaGI. First, in Figure 11 , we showed the visualization results of intermediate features in the inversion scenario. We indeed find that ∆ reflects the image-wise details precisely, which results in the robust reconstruction in X. Despite the images in Figure 11 containing extreme out-of-distribution components, e.g., complex backgrounds, fingers, and big scars on the face, our model preserves every detail consistently. In Figure 12 , we provide an extensive comparison between baselines and our model on both inversion and editing scenarios. While baselines omitted or deformed the image-wise details, our model robustly preserve details while maintaining high editability.

