GAN2GAN: GENERATIVE NOISE LEARNING FOR BLIND DENOISING WITH SINGLE NOISY IMAGES

Abstract

We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter requires two independently realized noisy image pair for a clean image. To that end, we propose GAN2GAN (Generated-Artificial-Noise to Generated-Artificial-Noise) method that first learns a generative model that can 1) simulate the noise in the given noisy images and 2) generate a rough, noisy estimates of the clean images, then 3) iteratively trains a denoiser with subsequently synthesized noisy image pairs (as in N2N), obtained from the generative model. In results, we show the denoiser trained with our GAN2GAN achieves an impressive denoising performance on both synthetic and real-world datasets for the blind denoising setting; it almost approaches the performance of the standard discriminatively-trained or N2N-trained models that have more information than ours, and it significantly outperforms the recent baseline for the same setting, e.g., Noise2Void, and a more conventional yet strong one, BM3D.

1. INTRODUCTION

Image denoising is one of the oldest problems in image processing and low-level computer vision, yet it still attracts lots of attention due to the fundamental nature of the problem. A vast number of algorithms have been proposed over the past several decades, and recently, the CNN-based methods, e.g., Cha & Moon (2019) ; Zhang et al. (2017) ; Tai et al. (2017) ; Liu et al. (2018) , became the throne-holders in terms of the PSNR performance. The main approach of the most CNN-based denoisers is to apply the discriminative learning framework with (clean, noisy) image pairs and known noise distribution assumption. While being effective, such framework also possesses a couple of limitations that become critical in practice; the assumed noise distribution may be mismatched to the actual noise in the data or obtaining the noise-free clean target images is not always possible or very expensive, e.g., medical imaging (CT or MRI) or astrophotographs. Several attempts have been made to resolve above issues. For the noise uncertainty, the so-called blind training have been proposed. Namely, a denoiser can be trained with a composite training set that contains images corrupted with multiple, pre-defined noise levels or distributions, and such blindly trained denoisers, e.g., DnCNN-B in Zhang et al. (2017) , were shown to alleviate the mismatch scenarios to some extent. However, the second limitation, i.e., the requirement of clean images for building the training set, still remains. As an attempt to address this second limitation, Lehtinen et al. (2018) recently proposed the Noise2Noise (N2N) method. It has been shown that a denoiser, which has a negligible performance loss, can be trained without the clean target images, as long as two independent noisy image realizations for the same underlying clean image are available. Despite its effectiveness, the requirement of the two independently realized noisy image pair for a single clean image, which may hardly be available in practice, is a critical limiting factor for N2N. In this paper, we consider a setting in which neither of above approach is applicable, namely, the pure unsupervised blind denoising setting where only single distinct noisy images are available for training. Namely, nothing is known about the noise other than it being zero-mean, additive, and independent of the clean image, and neither the clean target images for blind training nor the noisy image pairs for N2N training is available. While some recent work, e.g., Krull et al. (2019) ; Batson & Royer (2019) ; Laine et al. (2019) , took the self-supervised learning (SSL) approach for the same setting, we take a generative learning approach. The crux of our method is to first learn a Wasserstein GAN (Arjovsky et al., 2017) -based generative model that can 1) learn and simulate the noise in the given noisy images and 2) generate rough, initially denoised images. Using such generative model, we then synthesize noisy image pairs by corrupting each of the initially denoised images with the simulated noise twice and use them to train a CNN denoiser as in the N2N training (i.e., Noisy N2N). We further show that iterative N2N training with refined denoised images can significantly improve the final denoising performance. We dubbed our method as GAN2GAN (Generated-Artifical-Noise to Generated-Artificial-Noise) and show that the denoiser trained with our method can achieve (sometimes, even outperform) the performance of the standard supervised-trained or N2N-trained blind denoisers for the white Gaussian noise case. Furthermore, for mixture/correlated noise or real-world noise in microscopy/CT images, for which the exact distributions are hard to know a priori, we show our denoiser significantly outperforms those standard blind denoisers, which are mismatch-trained with white Gaussian noise, as well as other baselines that operate in the same condition as ours: the SSL baseline, N2V (Krull et al., 2019) , and a more conventional BM3D (Dabov et al., 2007) .

2. RELATED WORK

Several works have been proposed to overcome the limitation of the vanilla supervised learning based denoising. As mentioned above, Noise2Self (N2S) (Batson & Royer, 2019) and Noise2Void (N2V) (Krull et al., 2019) recently applied self-supervised learning (SSL) approach to train a denoiser only with single noisy images. Their settings exactly coincide with ours, but we show later that our GAN2GAN significantly outperforms them. More recently, Laine et al. (2019) improved N2V by incorporating specific noise likelihood models with Bayesian framework, however, their method required to know the exact noise model and could not be applied to more general, unknown noise settings. Similarly, Soltanayev & Chun (2018) proposed SURE (Stein's Unbiased Risk Estimator)based denoiser that can also be trained with single noisy images, but it worked only with the Gaussian noise. Their work was extended in Zhussip et al. (2019) , but it required noisy image pairs as in N2N as well as the Gaussian noise constraint. Chen et al. (2018) devised GCBD method to learn and generate noise in the given noisy images using W- GAN Arjovsky et al. (2017) and utilized the unpaired clean images to build a supervised training set. Our GAN2GAN is related to Chen et al. (2018) , but we significantly improve their noise learning step and do not use the clean data at all. Table 1 summarizes and compares the settings among the above mentioned recent baselines. We clearly see that only our GAN2GAN and N2V do not utilize any "sidekicks" that other methods use. Table 1 : Summary of different settings among the recent baselines.

Alg.\ Requirements

Clean image Noisy "pairs" Noise model N2N [Lehtinen et al. (2018) ] HQ SSL [Laine et al. (2019) ] SURE [Soltanayev & Chun (2018) ] Ext. SURE [Zhussip et al. (2019) ] GCBD [Chen et al. (2018) ] N2V [Krull et al. (2019) ] GAN2GAN (Ours) Additionally, there are recently published papers on blind image denoising but these also have a difference with ours. 

3. MOTIVATION

In order to develop the core intuition for motivating our method, we first consider a simple, singleletter Gaussian noise setting. Let Z = X + N be the noisy observation of X ∼ N (0, σ 2 X ), corrupted by the N ∼ N (0, σ 2 N ). It is well known that the minimum MSE (MMSE) estimator of X given Z is f * MMSE (Z) = E(X|Z) = σ 2 X σ 2 X +σ 2 N Z. We now identify the optimality of N2N in this setting. N2N Assume that we have two i.i.d. copies of the noise N : N 1 and N 2 . Then, let Z 1 = X + N 1 and Z 2 = X + N 2 be the two independent noisy observation pairs of X. The N2N in this setting corresponds to obtaining the MMSE estimator of Z 2 given Z 1 , f N2N (Z 1 ) arg min f E(Z 2 -f (Z 1 )) 2 = E(Z 2 |Z 1 ) = E(X + N 2 |Z 1 ) (a) = E(X|Z 1 ) = σ 2 X σ 2 X + σ 2 N Z 1 , in which (a) follows from N 2 being independent of Z 1 . Note (1) has the exact same form as f * MMSE (Z), hence, estimating X with f N2N (Z) also achieves the MMSE, in line with (Lehtinen et al., 2018) . "Noisy" N2N Now, consider the case in which we again have the two i.i.d. N 1 and N 2 , but the noisy observations are of a noisy version of X. Namely, let X = X + N 0 , in which N 0 ∼ N (0, σ 2 0 ), and denote Z 1 = X + N 1 and Z 2 = X + N 2 as the noisy observation pairs. Then, we can define a "Noisy" N2N estimator as the MMSE estimator of Z 2 given Z 1 , f Noisy N2N (Z 1 , y) arg min f E(Z 2 -f (Z 1 )) 2 = E(X |Z 1 ) = σ 2 X (1 + y) σ 2 X (1 + y) + σ 2 N Z 1 , in which we denote y σ 2 0 /σ 2 X and assume that 0 ≤ y < 1. Note clearly (2) coincides with (1) when y = σ 2 0 = 0. Following N2N, (2) is essentially estimating X based on Z = X + N . An interesting subtle question is what happens when we use the mapping f Noisy N2N (Z, y) for estimating X given Z = X + N , not X given Z . Our theorem below, of which proof is in the Supplementary Material (S.M.), shows that for a sufficiently large σ 2 0 , f Noisy N2N (Z, y) gives a better estimate of X than X . Theorem 1 Consider the single-letter Gaussian setting and f Noisy N2N (Z, y) obtained in (2). Also, assume 0 < y < 1. Then, there exists some Theorem 1 provides a simple, but useful, intuition that motivates our method; if simulating the noise in the images is possible, we may carry out the N2N training iteratively, provided that a rough noisy estimate of the clean image is initially available. Namely, we can first simulate the noise to generate noisy observation pairs of the initial noisy estimate, then do the Noisy N2N training with them to obtain a denoiser that may result in a better estimate of the clean image when applied to the actual noisy image subject to denoising (as in Theorem 1). Then, we can refine the estimates by iterating the Noisy N2N training with the generated noisy observation pairs of the previous step's estimate of the clean image, until convergence. y 0 s.t. ∀y ∈ (y 0 , 1), E(X -f Noisy N2N (Z, y)) 2 < σ 2 0 . 0 = To check whether above intuition is valid, we carry out a feasibility experiment. Figure 1 shows the denoising results on BSD68 (Roth & Black, 2009) for Gaussian noise with σ = 25. The blue line is the PSNR of the N2N model trained with noisy observation pairs of the clean images in the BSD training set, serving as an upper bound. The orange line, in contrast, is the PSNR of the Noisy N2N 1 model that is trained with the noisy observation pairs of the noisy estimates for the clean images, which were set to be another Gaussian noise-corrupted training images. The standard deviations (σ 0 ) of the Gaussian for generating the noisy estimates are given in the horizontal axis, and the corresponding PSNRs of the estimates are given in the parentheses. Although Noisy N2N 1 clearly lies much lower than the N2N upper bound, we note its PSNR is still higher than that of the initial noisy estimates, which is in line with Theorem 1. Now, if we iterate the Noisy N2N with the previous step's denoised images (i.e., Noisy-N2N 2 /Noisy-N2N 3 for second/third iterations, respectively), we observe that the PSNR significantly improves and approaches the ordinary N2N for most of the initial σ 0 values. Thus, we observe the intuition from Theorem 1 generalizes well to the image denoising case in an ideal setting, where the noise can be perfectly simulated, and the initial noisy estimates are Gaussian corrupted versions. The remaining question is whether we can also obtain similar results for the blind image denoising setting. We show our generative model-based approach in details in the next section.

4. MAIN METHOD: THREE COMPONENTS OF GAN2GAN

To concretely describe our method, we first set the notations. We assume the noisy image Z is generated by Z = x + N, in which x denotes the underlying clean image and N denotes the zeromean, additive noise that is independent of x. For training a denoiser, we do not assume either the distribution or the covariance of N is known. Moreover, we assume only a database of n distinct noisy images, D = {Z (i) } n i=1 , is available for learning a denoiser. A CNN-based denoiser is denoted as Xφ (Z) with φ being the model parameter, and we use the standard quality metrics, PSNR/SSIM, for evaluation. Our method consists of three parts; 1) smooth noisy patch extraction, 2) training a generative model, and 3) iterative GAN2GAN training of Xφ (Z), each of which we elaborate below.

4.1. SMOOTH NOISY PATCH EXTRACTION

The first step is to extract the noisy image patches from D that correspond to smooth, homogeneous areas. Our extraction method is similar to that of the GCBD proposed in (Chen et al., 2018) , but we make a critical improvement. The GCBD determines a patch p (of pre-determined size) is smooth if it satisfies the following for all of its smaller sub-patches, q j , with some hyperparameters µ, γ ∈ (0, 1): |E(q j ) -E(p)| ≤ µE(p), |V(q j ) -V(p)| ≤ γV(p), in which E(•) and V(•) are the empirical mean and variance of the pixel values. has to be evaluated for all the sub-patches, {q j }. Once N patches are extracted from D using (4), we subtract each patch with its mean pixel value, and obtain a set of "noise" patches, N = {n (j) } N j=1 . Such subtraction is valid since all the pixel values should be close to their mean in a smooth patch, and the noise is assumed to be zero-mean, additive. Figure 2 compares the rules (3) and ( 4) by showing the quality of the "noise" patches extracted from 1,000 Gaussian-corrupted images . The two plots in Figure 2 (a) show the normalized histograms of the empirical standard deviations, σ, of the extracted patches when true σ = {25, 50}, respectively. We clearly observe that while the σ's for (4) are mostly concentrated on true σ, those of (3) have much higher variation. In addition, Figure 2 (b) visualizes the randomly sampled patches of which σ's were above the 90-th percentile among the extracted patches for each rule (when σ = 25). Again, it is obvious that (3) also may result in selecting the patches with high-frequency patterns, whereas (4) is much more effective for extracting accurate noise patches. Later, we show (in Figure 5 ) that such improved quality of the noise patches by our (4) plays an essential role; namely, our pure unsupervised learning based denoiser using (4) even outperforms the clean target image based denoiser in (Chen et al., 2018) using (3).

4.2. TRAINING A W-GAN BASED GENERATIVE MODEL

Equipped with D = {Z (i) } n i=1 and the extracted noise patches N = {n (j) } N j=1 , we train a generative model, which can learn and simulate the noise as well as generate initial noisy estimates of the clean images, hence, realize the Noisy N2N training explained in Section 3. As shown in Figure 3 , our model has three generators, {g θ1 , g θ2 , g θ3 }, and two critics, {f w1 , f w2 }, in which the subscripts stand for the model parameters. The loss functions associated with the components of our model are: L n (θ 1 , w 1 ) E n f w1 (n)] -E r [f w1 (g θ1 (r)) (5) L Z (θ 1 , θ 2 , w 2 ) E Z f w2 (Z) -E Z,r f w2 (g θ2 (Z) + g θ1 (r)) (6) L cyc (θ 2 , θ 3 ) E Z z -g θ3 (g θ2 (Z)) 1 . The loss ( 5) is a standard W-GAN (Arjovsky et al., 2017) loss for training the first generator-critic pair, (g θ1 , f w1 ), of which g θ1 learns to generate the independent realization of the noise mimicking the patches in N = {n (j) } N j=1 , taking the random vector r ∼ N (0, I) as input. The second loss (6) links the two generators, g θ1 and g θ2 , with the second critic, f w2 . The second generator g θ2 is intended to generate the estimate of the underlying clean patch for Z, i.e., coarsely denoise Z, and the critic f w2 determines how close the distribution of the generated noisy image, g θ2 (Z) + g θ1 (r), is to the that of Z 1 . Our intuition is, if g θ1 can realistically simulate the noise, then enforcing g θ2 (Z) + g θ1 (r) to mimick Z would result in learning a reasonable initial denoiser g θ2 . One important detail regarding g θ2 is its final activation must be the sigmoid function for stable training. The third loss (7), which resembles the cycle loss in (Zhu et al., 2017) , imposes the encoder-decoder structure between g θ2 and g θ3 , hence, helps g θ2 to compress the most redundant part of Z, i.e., the noise, and carry out the initial denoising. Once the losses are defined, training the generators and critics are done in an alternating manner, as in the training of W-GAN (Arjovsky et al., 2017) , to approximately solve min θ1,θ2,θ3 max w1,w2 Sample αL n (θ 1 , w 1 )+ βL Z (θ 1 , θ 2 , w 2 ) + γL cyc (θ 2 , θ 3 ) , {n (i) } m i=1 ∼ N , {r (i) } m i=1 ∼ N (0, I), {Z (i) } m i=1 ∼ D 5: for ep critic ← 1, n critic do 6: g w1 ← ∇ w1 [L n (θ 1 , w 1 )], g w2 ← ∇ w2 [L Z (θ 1 , θ 2 , w 2 )] 7: w 1 ← Clip(w 1 + α critic • Adam(w 1 , g w1 ), -c, c) 8: w 2 ← Clip(w 2 + α critic • Adam(w 2 , g w2 ), -c, c) 9: end for 10: g θ1 , g θ2 , g θ3 ← ∇ θ1,θ2,θ3 [ αL n (θ 1 , w 1 ) + βL Z (θ 1 , θ 2 , w 2 ) + γL cyc (θ 2 , θ 3 ) ] 11: θ 1 ← θ 1 -α g •Adam(θ 1 , g θ1 ) , θ 2 ← θ 2 -α g •Adam(θ 2 , g θ2 ) , θ 3 ← θ 3 -α g •Adam(θ 3 , g θ3 ) 12: end for 13: return θ 1 , θ 2

4.3. ITERATIVE GAN2GAN TRAINING OF A DENOISER

With our generative model, we then carry out the iterative Noisy N2N training as described in Section 3, with the generated noisy images. Namely, given each Z (i) ∈ D, we generate the pair 2009) as a test set. For real-noise experiment, we experimented on two data sets: the WF set in the microscopy image datasets in (Zhang et al., 2019) and the reconstructed CT dataset. For both datasets, we trained/tested on each (Avg = n) and each dose level, respectively, which corresponds to different noise levels. For the generative model training, the patch size used for D and N was 96 × 96, and n and N were set to 20, 000 (BSD) and 40, 000 (microscopy), respectively. For the iterative G2G training, the patch size for D was 120 × 120 and n = 20, 500, and in every mini-batch, we generated new noisy pairs with g θ1 as in the noise augmentation of (Zhang et al., 2017) . The architecture of G2G j was set to 17-layer DnCNN in (Zhang et al., 2017) . We put full details on training, model architectures and hyperparameters as well as the software platforms in the S.M. Baselines The baselines were BM3D (Dabov et al., 2007) , DnCNN-B (Zhang et al., 2018) , N2N (Lehtinen et al., 2018) , and N2V (Krull et al., 2019) . We reproduced and trained DnCNN-B, N2N and N2V using the publicly available source codes on the exactly same training data as our iterative G2G training. For DnCNN-B and N2N, which use either clean targets or two independent noisy image copies, we used 20-layers DnCNN model with composite additive white Gaussian noise with σ ∈ [0, 55]. N2V considers the same setting as ours and uses the exact same architecture as G2G j ; more details on N2V are also given in the S.M. We could not compare with the scheme in (Laine et al., 2019) , since their code cannot run beyond white Gaussian noise case in our experiments and they had an unfair advantage: they newly generate noisy images by corrupting given clean images for every mini-batch whereas we assume the given noisy images are fixed once for all. It is known that such noise augmentation significantly can increase the performance, and their code could not run in our setting in which the noisy images are fixed once given. As an upper bound, we implemented N2C(Eq.( 4)), denoting a 17-layer DnCNN trained with clean target images in BSD400 and their noisy counterpart, which is corrupted by our g θ1 learned with (4).

5.2. DENOISING RESULTS ON SYNTHETIC NOISE

White Gaussian noise Table 2 shows the results on BSD68 corrupted by white Gaussian noise with different σ's. Several variations of our G2G, g θ2 and the G2G iterates, G2G j≥1 , are shown for two different training data versions for learning the generative model. Firstly, we clearly observe the particularly when the quality of the initial estimate is not good enough. This result confirms the result of Figure 1 indeed carries over to the blind denoising setting with our method. Secondly, we note G2G 1 already considerably outperforms N2V, which is trained with the exact same model architecture and dataset. Finally, the performance of G2G 3 is very strong; it outperforms BM3D, which knows true σ, and even sometimes outperforms the blindly trained DnCNN-B and N2N, which is trained with the same BSD400 dataset, but with more information. This somewhat counter-intuitive result is possible since our G2G j accurately learns the correct noise level in the image, while DnCNN-B and N2N are trained with the composite noise levels, σ ∈ [0, 55]. Mixture and correlated noise Table 3 shows the results on mixture and correlated noise beyond white Gaussian. Note our G2G j does not assume any distributional or correlation structure of the noise, hence, it can still run as long as the assumption on the noise holds. In the table, the G2G results are for (BSD) as specified above. Moreover, DnCNN-B and N2N are still blindly trained with the mismatched white Gaussian noise. For mixture noise, we tested with two cases. Case A corresponds to the same setting as given in (Chen et al., 2018) , i.e., 70% ∼ N (0, 0.1 2 ), 20% ∼ N (0, 1), and 10% ∼ Unif[-s, s] which means the random variable that is uniformly distributed between [-s, s] with s = 15, 25. For case B, we tested with larger variances, i.e., 70% Gaussian N (0, 15 2 ), 20% Gaussian N (0, 25 2 ), and 10% Uniform [-s, s] with s = 30, 50. For correlated noise, we generated the following noise for each -th pixel, N = ηM + (1 -η) 1 |N B | m∈N B M m , = 1, 2, . . . in which {M } are white Gaussian N (0, σ 2 ), N B is the k × k neighborhood patch except for the pixel , and η is a mixture parameter. We set η = 1/ √ 2 such that the marginal distribution of N is also N (0, σ 2 ) and set k = 16. Note in this case, N has a spatial correlation, and we tested with σ = 15, 25. From the table, we first note that DnCNN-B and N2N suffer from serious performance degradation for both mixture and correlated noises due to noise mismatch, and the conventional BM3D outperforms them for some cases (e.g., Case A for mixture noise). However, we note our G2G 2 can still denoise very well after just two iterations and outperforms all the baselines for all noise types. Note N2V seriously suffers and is not comparable to ours. Finally, N2C(Eq.( 4)) is a sound upper bound for all noise types, confirming the correctness of the extraction rule (4).

5.3. DENOISING RESULTS ON REAL NOISE

We also test our method on the real-world noise. While some popular real noise is known to have source-dependent characteristics, there are also cases in which the noise is source-independent and pixel-wise correlated, which satisfies the assumption of our method. We tested on two such datasets, the Wide-Focal (WF) set in the microscopy image dataset (Zhang et al., 2019) and a Reconstructed CT dataset. A more detailed description and analysis on these two datasets are in S.M. The WF and Reconstructed CT data has 5 sets (Avg = 1, 2, 4, 8, 16) and 4 sets (Dose=25, 50, 75, 100) with noise levels, respectively. We did not exploit the fact that the images are multiple noisy measurements of a clean image, which enables employing N2N, but treated them as noisy images of distinct clean images. Figure 4 (a) and 4(b) shows the PSNR of all methods for each dataset, respectively, averaged over all sets. The baselines were DnCNN-B, BM3D and N2V. We note BM3D estimated noise σ using the method in Chen et al. (2015) . We iterated until G2G 3 and N2C(Eq.( 4)) was an upper bound for each set. We clearly observe that the performance of G2G j significantly improves (over g θ2 ) as the iteration continues. In results, G2G 3 becomes significantly better than DnCNN-B and N2V as well as BM3D, still one of the strongest baselines for real-world noise denoising when no clean target images are available, for both datasets. We report more detailed experimental results (including SSIM) on both datasets in S.M. Moreover, the inference time for BM3D is about 4.5∼5.0 seconds per image since a noise estimation has to be done for each image separately, whereas that for G2G j is only 4 ms (on GPU), which is another significant advantage of our method. Figure 4 (c) shows the visualizations on the WF, and we give more examples in the S.M.

5.4. ABLATION STUDY

Noise patch extraction Here, we evaluate the effect of the noisy patch extraction rules (3) and (4) in the final denoising performance. Figure 5 compares the PSNR of N2C(GCBD Eq.( 3)), a re-implementation of (Chen et al., 2018) , N2C(Ours Eq.( 4)) and the best G2G, for each dataset. We note neither source code nor training data of (Chen et al., 2018) is publicly available, and the PSNR in (Chen et al., 2018) could not be reproduced (with the exact same η and γ as in (Chen et al., 2018) ). From the figure, we clearly observe the significant gap between N2C(Our Eq.( 4)) and N2C(GCBD Eq.( 3)), particularly when the noise is not white Gaussian. Moreover, our pure unsupervised G2G with (4) even outperforms N2C(GCBD Eq.( 3)) that utilizes the clean target images, confirming the quality difference shown in Figure 2 (b) significantly affects learning noise and a denoiser. Generative model and iterative G2G training Figure 6 (a) shows the PSNRs of g θ2 on BSD68/Gaussian(σ = 25) trained with three variations; "No L Z " for no f w2 , "No L cyc " for no (7) and g θ3 , and "No sigmoid" for no sigmoid activation at the output layer of g θ2 . We confirm that our proposed architecture achieves the highest PSNR for g θ2 , the sigmoid activation and f w2 are essential, and the cylce loss ( 7) is also important. Achieving a decent PSNR for g θ2 is beneficial for saving the number of G2G iterations and achieving high final PSNR. More detailed analyses on the generative model architecture are in the S.M. Figure 6 the initial estimate for the iterative G2G training. From Figure 1 , one may ask whether g θ2 is indeed necessary, since even when σ 0 ≈ σ, the iterating the Noisy N2N can mostly achieve the upper bound. Hence, for samples of synthetic and real microscopy data, we evaluate how G2G j performs when the iteration simply starts with Z. Figure 6 (b) shows a somewhat surprising result that for synthetic noises, starting from Z achieves essentially the same performance as starting from g θ2 with a couple more G2G iterations. However, for real microscopy noise case in Figure 6 (c), WF(Avg= 1) in which starting from Z achieves far lower performance than starting from g θ2 , justifying our generative model for attaining the initial noisy estimate.



We assume g θ implicitly has the cropping step for Z such that the dimension of g θ 2 (Z) and g θ 1 (r) match. CONCLUDING REMARKMotivated by a novel observation on Noisy N2N, we proposed a novel GAN2GAN method, which can tackle the challenging blind image denoising problem solely with single noisy images. Our method showed impressive denoising performance that even sometimes outperform the methods with more information as well as VST+BM3D for real noise denoising. As a future work, we plan to extend our framework to more explicitly handle the source-dependent real-world noise.



Figure 1: Iterative Noisy N2N.

Figure 2: Comparison of smooth noisy patch extraction rules. While (3) works for extracting smooth patches to some extent, as we show in Figure 2(b), it does not rule out choosing patches with high-frequency repeating patterns, which are far from being smooth. Thus, we instead use the 2D discrete wavelet transform (DWT) for a new extraction rule; namely, we determine p is smooth if its four sub-band decompositions obtained by DWT, {W k (p)} 4 k=1 , satisfy 1 4 4

Figure 3: Overall structure of the W-GAN based generative model.

in which (α, β, γ) are hyperparameters to control the trade-offs between the loss functions. The pseudo algorithm for training a generative model is given in Algorithm 1. There are a couple of subtle points for training with the overall objective (8), and we describe the full details on model architectures and hyperparameters in the S.M. Algorithm 1 Training a generative model, all experiments in this paper used the defaults values, n critic = 5, n epoch = 30, m = 64, α g = 4e -4 , α critic = 5e -5 , α = 5, β = 1, γ = 10 1: Require D, λ 2: N ← NoisePatchExtraction(D, λ) 3: for ep GAN ← 1, n epoch do 4:

Figure 4: Results on real microscopy image denoising on WF and medical image denoising.

Figure 5: Effect of noise patch extraction rule.

Figure 6: Ablation studies. (b) and (c) compare the performances between starting from g θ2 and Z.

Results on BSD68/Gaussian. Boldface denotes algorithms that only use single noisy images. Red and blue denotes the highest and second highest result among those algorithms, respectively.

Results on BSD68/Mixture & Correlated noise. The boldface and colored texts are as before.

acknowledgement

ACKNOWLEDGMENT This work was supported in part by NRF Mid-Career Research Program [NRF-2021R1A2C2007884] and IITP grant [No.2019-0-01396, Development of framework for analyzing, detecting, mitigating of bias in AI model and training data], funded by the Korean government (MSIT).

annex

( Ẑ(i) 11 , Ẑ(i) 12 ) (g θ2 (Z (i) ) + g θ1 (r11 ), g θ2 (Z (i) ) + g θ1 (rin which r12 ∈ R 128 are i.i.d. ∼ N (0, I). In contrast to the ideal case in Section 3, each generated image in ( 9) is a noise-corrupted version of g θ2 (Z (i) ), in which the corruption is done by the simulated noise g θ1 (r). Denoting the set of such pairs as D1 = {(In L G2G (•), we only use the generated noisy images and do not use the actual observed Z (i) , hence, we dubbed our training as GAN2GAN (G2G) training. Now, denoting the learned denoiser as G2G 1 (with parameter φ 1 ), we can iterate the G2G training. For the j-th iteration (with j ≥ 2), we generatefor each Z (i) and denote the resulting set of the pairs as Dj . Note in (10), we update the noisy estimate of the clean image with the output of G2G j-1 . Then, the new denoiser G2G j is obtained by computing φ j arg min φ L G2G (φ, Dj ), where the minimization is done via warm-starting from φ j-1 . In our experiments, we show the sequence, G2G j≥1 , successively refines the denoising quality and significantly improves the initial noisy estimate, similarly as in Figure 1 . Moreover, we identify the benefit of the iterative G2G training becomes greater when noise is more sophisticated; i.e., for synthetic noise, the performance of G2G j≥1 converges after 1∼3 iterations, whereas for the real-world microscopy noise, the performance keeps increasing until larger number of iterations.

5.1. DATA AND EXPERIMENTAL SETTINGS

Data & training details In synthetic noise experiments, we always used the noisy training images from BSD400 (Martin et al., 2001) . For evaluation, we used the standard BSD68 (Roth & Black, 

