AN UNSUPERVISED DEEP LEARNING APPROACH FOR REAL-WORLD IMAGE DENOISING

Abstract

Designing an unsupervised image denoising approach in practical applications is a challenging task due to the complicated data acquisition process. In the realworld case, the noise distribution is so complex that the simplified additive white Gaussian (AWGN) assumption rarely holds, which significantly deteriorates the Gaussian denoisers' performance. To address this problem, we apply a deep neural network that maps the noisy image into a latent space in which the AWGN assumption holds, and thus any existing Gaussian denoiser is applicable. More specifically, the proposed neural network consists of the encoder-decoder structure and approximates the likelihood term in the Bayesian framework. Together with a Gaussian denoiser, the neural network can be trained with the input image itself and does not require any pre-training in other datasets. Extensive experiments on real-world noisy image datasets have shown that the combination of neural networks and Gaussian denoisers improves the performance of the original Gaussian denoisers by a large margin. In particular, the neural network+BM3D method significantly outperforms other unsupervised denoising approaches and is competitive with supervised networks such as DnCNN, FFDNet, and CBDNet.

1. INTRODUCTION

Noise always exists during the process of image acquisition and its removing is important for image recovery and vision tasks, e.g., segmentation and recognition. Specifically, the noisy image y is modeled as y = x + n, where x denotes the clean image, n denotes the corrupted noise and image denoising aims at recovering x from y. Over the past two decades, this problem has been extensively explored and many works have been proposed. Among these works, one typical kind of model assumes that the image is corrupted by additive white Gaussian noise (AWGN), i.e., n ∼ N (0, σ 2 I) where N (0, 1) is the standard Gaussian distribution. Representative Gaussian denoising approaches include block matching and 3D filtering (BM3D) (Dabov et al., 2007b) , non-local mean method (NLM) (Buades et al., 2005) , K-SVD (Aharon et al., 2006) and weighted nuclear norm minimization (WNNM) (Gu et al., 2014) , which perform well on AWGN noise removal. However, the AWGN assumption seldom holds in practical applications as the noise is accumulated during the whole imaging process. For example, in typical CCD or CMOS cameras, the noise depends on the underlying context (daytime or nighttime, static or dynamic, indoor or outdoor, etc.) and the camera settings (shutter speed, ISO, white balance, etc.). In Figure 1 , two real noisy images captured by Samsung Galaxy S6 Edge and Google Pixel smartphones are chosen from Smartphone Image Denoising Dataset (SIDD) (Abdelhamed et al., 2018) and three 40 × 40 patches are chosen for illustration of noisy distribution. It is clear that real noise distribution is content dependent and noise in each patch has different statistical properties which can be non-Gaussian. Due to the violation of the AWGN assumption, the performance of the Gaussian denoiser deteriorates significantly (Figure 1 (d) ). Thus, it is crucial to characterize the noise distribution and adapt the noise models to the denoiser in real-world image denoising. In recent years, deep learning based methods have achieved remarkable performance with careful architecture design, good training strategies, a large number of noisy and clean image pairs. However, there are two main drawbacks of these approaches from the perspective of practical applications. One is the high dependency on the quality and the size of the training dataset. Collecting such image pairs is time-consuming and requires much of human efforts, especially when the labeling needs deep domain knowledge such as medical or seismic images. The very recent deep learning methods including Noise2Noise (N2N) (Lehtinen et al., 2018) , Noise2Void (N2V) (Krull et al., 2019a) and Noise2Self (N2S) (Batson & Royer, 2019) have relaxed the dataset requirement and can be trained on organized/un-organized noisy and noisy image pairs. Nevertheless, to guarantee the performance, these networks need to be pre-trained with a large number of images to cover sufficiently many local patterns, and thus they are not cost-effective. Therefore, to reduce the dependency of the training set, single-image based image denoising approaches deserved to be studied and have both practical and scientific value. It is worth mentioning that a recent unsupervised learning work (Ulyanov et al., 2018) uses a deep image prior to the general image recovery problem but its denoising results are inferior to some typical Gaussian denoisers, e.g., BM3D. The other drawback is the generalization ability of a trained network. When the noisy distribution is complicated and not contained in the training set, the results of the deep learning method can be deteriorated significantly, even worse than non-learning based methods. To alleviate this problem, some recent works are proposed by further consideration of noise estimation in the network design, e.g., Guo et al. (2019) ; Yue et al. (2019) ; Zhang et al. (2017) . Despite their good performance in blind Gaussian denoising (Guo et al., 2019; Zhang et al., 2017) and real-world denoising problem (Yue et al., 2019) , a large number of noisy and clean image pairs are needed and the generalization problem remains when the imaging system is complicated. Very recently, a single image based method has been proposed in (Quan et al., 2020) by developing a novel dropout technique for image denoising. Thus, unsupervised deep learning approaches with accurate noise models are important for solving real-world image denoising problems, yet current solutions are unsatisfactory. Such approach deserves to be studied and is a challenging problem as it needs a good combination of traditional methods and deep learning based methods such that the benefits of both methods are fully explored.

1.1. THE SUMMARY OF IDEAS AND CONTRIBUTIONS

Motivated by the above analysis, the goal of this paper is to propose an unsupervised deep learning method that boosts the performance of existing Gaussian denoisers when solving real-world image denoising problems. The basic idea is to find a latent image z associated with the input noisy image y such that z|x satisfies the AWGN assumption, and thus we can obtain the clean image x from z by any existing Gaussian denoiser. To find the appropriate latent representation, we propose a neural network (NN) based approach that builds up the mapping between the noisy image y and latent image z with an encoder-decoder structure. By applying the Gaussian denoiser in the latent space, we alternatively update the NN and the denoised image which does not need the other training samples. Figure 2 illustrates the workflow of the proposed approach. Also, this idea can be formulated under the classical maximum a posterior (MAP) framework which consists of a likelihood term and a prior term. Building a proper likelihood term requires an accurate estimation of the noise distribution. Although the accurate noise distribution is difficult to get, an evidence lower bound (ELBO) can be analytically derived for approximating the likelihood from below using variational auto-encoder (VAE) (Kingma & Welling, 2013) . This ELBO term gives the loss function for the encoder and decoder networks that are maps between noisy image and latent image. From the above derivation, we arrive at a model min x,F ,G f (x, y, F, G) + R(x), where x is the clean image, y is the input noisy image, F, G are decoder and encoder maps parameterized by NNs, f is the loss from ELBO and R(x) is the regularization term. Model (1) can be minimized by the alternative direction method of multiplier (ADMM) which alternatively updates networks and the clean image estimation x. Using Plug-and-Play technique (Venkatakrishnan et al., 2013) , updating x can be replaced by any Gaussian denoiser. Thus, by fully exploiting the benefits of deep neural networks and classic denoising schemes, the real-world image denoising can be improved by a large margin as shown in Figure 1 (e). More importantly, training the proposed networks only uses the noisy image itself and does not need any pre-training. In summary, we list our main contributions as follows. • We propose an effective approach that combines the deep learning method with traditional methods for unsupervised image denoising. Thanks to the great expressive power of deep neural networks, the complex noise distribution is mapped into a latent space in which the AWGN assumption tends to hold, and thus better results are obtained by applying existing Gaussian denoiser for latent images. • Instead of a heuristic loss design, the proposed NN approximates the likelihood in the classic Bayesian framework, which gives clear interpretations of each loss term. Meanwhile, compared to many existing deep learning methods, our model is only trained on a single image, which significantly reduces the burden of data collection. • Extensive numerical experiments on real-world noisy image datasets have shown that the NN boosts the performance of the existing denoisers including NLM, BM3D and DnCNN. In particular, the results of NN+BM3D are competitive with some supervised deep learning approaches such as DnCNN+, FFDNet+, CBDNet. There are numerous works in image denoising. Here, we review the works related to the nonlearning and learning approaches for real-world image denoising.

2.1. NON-LEARNING BASED APPROACHES

Non-learning approaches are mainly based on the MAP framework that contains a data fidelity and a regularization term. Many works have been proposed to improve the regularize term, e.g., sparsity based methods (Rudin et al., 1992; Perona & Malik, 1990) , low rank prior (Dong et al., 2012b; Gu et al., 2014) and non-local methods (Buades et al., 2005; Dabov et al., 2007a) . Among these methods, BM3D (Dabov et al., 2007a) is one of the top methods. There are several other works related to the construction of data fidelity by modeling the complex noise distribution, e.g., Lebrun et al. (2015a) ; Nam et al. (2016) ; Xu et al. (2017); Zhu et al. (2016) . The correlated Gaussian distribution (Lebrun et al., 2015a) and Mixture of Gaussian (Zhu et al., 2016; Nam et al., 2016) are used to approximate the unknown noise distribution. In Xu et al. (2017) , different noise statistics are estimated in different channels without the consideration of the content dependent noise. Recently, Amini et al. (2020) proposed a Gaussianization method for gray scale OCT images. However, this method is not applicable for real-world image denoising tasks as natural images are colorful and the noise distribution is much more complicated than that in OCT images. Overall, due to the complexity of real-world noise, the performance of these approaches is unsatisfactory and needs to be improved.

2.2. LEARNING BASED APPROACHES

The learning based approaches can be classified into two groups: single-image based methods and dataset based methods. Typical single-image based approaches are sparse coding methods Aharon et al. (2006) ; Bao et al. (2015) ; Xu et al. (2018b) . In Xu et al. (2018b) , the noise in each channel is estimated and followed by a weighted sparse coding scheme. In recent years, as the appearance of real image denoising datasets including CC (Nam et al., 2016) , PolyU (Xu et al., 2018a) , DND (Plotz & Roth, 2017) and SIDD (Abdelhamed et al., 2018) Recently, the deep learning approaches (Krull et al., 2019a; b; Batson & Royer, 2019; Laine et al., 2019; Lehtinen et al., 2018) are proposed and trained with organized or unorganized noisy image pairs. To guarantee a satisfactory performance, these methods still need many training pairs such that sufficiently many local patterns are covered. Compared to the above deep learning approaches, our method is a single image based method which does not need training samples or pre-training from other datasets.

3. OUR METHODOLOGY

This section starts with the derivation of our model and is followed by the detailed optimization techniques that incorporate any existing Gaussian denoiser.

3.1. THE MODEL FORMULATION

Let y ∈ R N be the noisy image where N = Height×Width×3. The classic MAP framework aims at maximizing the posterior distribution p(x|y) formulated as the following optimization problem: max x ln p(x|y) ∝ max x ln p(y|x) + ln p(x) = max x ln p(y|x) -λR(x). The term p(x) is the prior which represents the internal statistics of natural images. One common choice is p(x) ∝ exp(-λR(x)) where R(x) is a regularization function. The term p(y|x) is the likelihood that models the uncertainty of the observed image, i.e., yx ∼ p(n) where p(n) denotes the noise distribution. In practice, the noise distribution is complex and it is only possible to find an approximation. Following the VAE approach (Kingma & Welling, 2013), the likelihood p(y|x) has a lower bound, i.e., ln p(y|x) = ln p(y, z|x)dz = ln q(z|y) p(y, z|x) q(z|y) dz ≥ q(z|y) ln p(y, z|x) q(z|y) dz = E q(z|y) ln p(y|z, x) -KL(q(z|y)||p(z|x)) := ELBO, where z is the latent image, q is the distribution of the latent image z conditioned on the noisy image y, KL denotes the Kullback-Leibler (KL) divergence and the ELBO stands for the evidence lower bound. In practice, we can construct a tractable ELBO with high expressive power, which motivates the usage of NNs. The proposed NN consists of an encoder net G and a decoder net F that construct the mappings between the noisy image y and the latent image z. Suppose the latent image z is the Gaussian corruption of the clean image x with strength σ, i.e., z|x ∼ N (x, σ 2 I) where N (0, I) is the standard Gaussian distribution. By choosing p(y|z, x) and q(z|y) properly, the next proposition gives a closed form of the ELBO. Proposition 1 Suppose z|x ∼ N (x, σ 2 I). Choosing y|z, x ∼ N (F(z), I) and q(z|y) = N (G(y), I) where F, G are decoder and encoder respectively, the ELBO is equal to - 1 2 E ||F(G(y) + ) -y|| 2 - 1 2σ 2 ||G(y) -x|| 2 + c, where ∼ N (0, I) and c is a constant depends on the image size. The proof is given in appendix A.1. Let F and G be parameterized by θ 1 and θ 2 respectively, we replace the likelihood in (2) by ( 3) which derives our model: E(x, θ 1 , θ 2 ) = 1 2 E ||F θ1 (G θ2 (y) + ) -y|| 2 + 1 2σ 2 ||G θ2 (y) -x|| 2 + λR(x). There is an expectation term in (4), which can be estimated by the Monte Carlo method. Following the standard sampling approach in VAE (Kingma & Welling, 2013), we resample every time before backpropagation in our experiments. Remark 1 In our model, the basic assumption is that the latent image z is a white Gaussian perturbations of clean image x, i.e. z|x ∼ N (x, σ 2 I). Since z = G θ2 (y), the second term in (4) can be seen as the transformed data fidelity. Moreover, the combination of those three terms can prevent the encoder from degenerating into zero mapping or identity mapping.

3.2. MODEL OPTIMIZATION

Since the loss function ( 4) is non-convex and the regularization R(x) can be complicated, we introduce an auxiliary variable p and apply the ADMM scheme for solving (4). Concretely, define f (x, θ 1 , θ 2 ) = 1 2 E ||F θ1 (G θ2 (y) + ) -y|| 2 + 1 2σ 2 ||G θ2 (y) -x|| 2 , the loss function (4) is rewritten as min x,θ1,θ2 f (x, θ 1 , θ 2 ) + R(p), s.t. p -x = 0. (5) The augmented Lagrangian function of ( 5) is L ρ (x, θ 1 , θ 2 , p) = f (x, θ 1 , θ 2 ) + R(p) + ρ 2 x -p + q/ρ 2 - ρ 2 q/ρ 2 , where q is the dual variable and ρ > 0 is a chosen constant. ADMM consists of the following iterates: (x k+1 , θ k+1 1 , θ k+1 2 ) = arg min x,θ1,θ2 L ρ (x, θ 1 , θ 2 , p k , q k ), p k+1 = arg min p L ρ (x k+1 , θ k+1 1 , θ k+1 2 , p, q k ), q k+1 = q k + ρ(x k+1 -p k+1 ). Subproblem ( 6) is solved by alternating minimization, i.e., update the clean image estimation x and the network parameters θ 1 and θ 2 alternatively. More specifically, the networks F and G are updated by back propagation method with fixed x. Then, θ 1 and θ 2 are fixed, x is updated by solving the minimization: min x 1 2σ 2 ||x -G θ2 (y)|| 2 + ρ 2 x -p + q/ρ 2 . ( ) The least square problem (8) has a closed form solution. Besides, subproblem ( 7) is equivalent to min p R(p) + ρ 2 ||p -x k+1 -q k /ρ|| 2 . Then p k+1 = Prox ρR (x k+1 + q k /ρ) where the proximal mapping is defined as Prox ρR (z) = arg min x R(z) + ρ 2 zx 2 . Motivated by this observation, the Plug-and-Play method (Venkatakrishnan et al., 2013) is proposed by generalizing this proximal mapping to any existing denoising scheme, i.e. p k+1 = T (x k+1 + q k /ρ). In our method, we choose the T as any existing Gaussian denoiser, e.g., NLM (Buades et al., 2005) , BM3D (Dabov et al., 2007a) . Overall, the detailed algorithm is given in Algorithm 1 where X denotes the Gaussian denoiser. Remark 2 We initialized randomly for network parameters θ 1 and θ 2 without the usage of a pretrained model. In our method, p is initialized to y, and q is initialized to 0 so that x is a linear combination of y and the latent space image at the early stage of this algorithm. Therefore, x is not far from y in the beginning, which will lead to a correct convergence for our algorithm. Algorithm 1 The Denoising Algorithm NN+X. Input: Noisy image y, ρ, σ, η; Output: Denoised image x; Initial x 0 = y, p 0 = y, q 0 = 0, and network parameters θ 1 , θ 2 . for k = 0, 1, 2, 3, ..., M do for i = 0, 1, 2, 3, ..., m do Sample the from standard Gaussian distribution in (6). Update θ k,i+1 1 , θ k,i+1 2 by using backpropagation (BP) algorithm for (6). Update x k,i+1 = 1 σ 2 G θ k,i+1 2 (y) + ρp k -q k / ρ + 1 σ 2 . end for Set x k+1 = x k,m , θ k+1 1 = θ k,m 1 , θ k+1 2 = θ k,m 2 . Update p k+1 by Gaussian Denoiser algorithm X to x k+1 + q k /ρ. Update q k+1 = q k + ηρ(x k+1 -p k+1 ). end for return x = x N .

4. EXPERIMENTS

We evaluate the performance of our method on real noisy images in this section. All experiments are evaluated in the sRGB space.

4.1. IMPLEMENTATION DETAILS

Both encoder network G and decoder network F are chosen as two standard 10 layers U-Nets (Ronneberger et al., 2015) implemented in Pytorch using Nvidia 1080TI or Nvidia 2080TI GPUs. ADAM algorithm (Kingma & Ba, 2014) is adopted to optimize the network parameters and the learning rate is set as 0.01. The number of epoch in network training is set as 500 and the parameters ρ, σ and η in ADMM are set as 1, 5 and 0.5 respectively. The noise level is estimated from Donoho & Johnstone (1994); Chen et al. (2015a) . For an image with size 512 × 512 × 3, our method needs about 15 minutes with a single Nvidia 2080TI GPU.

4.2. EXPERIMENTS ON REAL-WORLD NOISE

We combine the NN with three existing Gaussian denoiser methods, including two traditional methods (Non-local Mean (NLM) (Buades et al., 2005) and Block Matching and 3D Filtering (BM3D) (Dabov et al., 2007b) ) and one pre-trained deep learning method (DnCNN) (Zhang et al., 2017) from its official project website. We choose two nature real-world noisy image datasets named as CC (Nam et al., 2016) , PolyU (Xu et al., 2018a) , and one real fluorescence microscopy dataset named FMDD (Zhang et al., 2019) for testing the performance of our method in terms of PSNR and SSIM. Besides, the Variance Stabilizing Transform (VST) method (Makitalo & Foi, 2012) , a traditional noisy transformation method, is chosen for comparison. We employ the noise estimation method (Foi et al., 2008) to estimate the Poisson-Gaussian noise parameters for the VST method in FMDD. In CC and PolyU, we evaluate different Poisson noise parameters (peak value, range from 10 to 10000) and select the best for whole datasets. There are 15 and 100 images in CC (Nam et al., 2016) and PolyU (Xu et al., 2018a) datasets with the same cropped regions in their original papers. In FMDD, we evaluate the performance on the mixed test set with raw images which is the same setting as in Zhang et al. (2019) . Denoising results. The denoising results are evaluated using PSNR and SSIM, with built-in functions in Python skimage package. The results are listed in Table 1 from which we have the following observations: (a) Our model improves the performance of both three Gaussian denoiser by a large margin on both two datasets. On average, the PSNR/SSIM values are increased by 2.35/0.0588, 2.87/0.0844 and 1.173/0.0331 for NLM, BM3D and DnCNN respectively, see Figure 3 for one visual example from CC. (b) Comparing with the VST method, the existing Gaussian Denoisers benefit more from the proposed method.

5. DISCUSSION

Validation of AWGN assumption of latent images. Let x, y, z be clean, noisy and latent images. Denote n 1 = yx and n 2 = zx are the noise in image space and latent space respectively. We visualize the distribution of n 1 and n 2 in Figure 4 using two images in Figure 1 . It is obvious that the noise distribution in latent space is more similar to a white Gaussian than in image space. Moreover, define a i = Mean(n i ) and σ 2 i = Var(n i ) for i = 1, 2 where Mean, Var are the mean and variance operators respectively. To quantify the distance of between n i , i = 1, 2 and white Gaussian distribution, we calculate the KL divergence between n i and N (a i , σ 2 i I) on CC dataset. In image space, the KL divergence is 0.4701, while in latent space the KL divergence reduces to 0.3821, 0.2868 and 0.3636 for NN+NLM, NN+BM3D and NN+DnCNN respectively. We note the noise distribution in latent space is closer to the Gaussian distribution than its in image space. Comparison with supervised/unsupervised methods. Three blind image denoising networks including DnCNN+ (Zhang et al., 2017) , FFDNet+ (Zhang et al., 2018 ), CBDNet (Guo et al., 2019) and three unsupervised approaches including Noise2Noise (N2N) (Lehtinen et al., 2018) , Deep Image Prior (DIP ) (Ulyanov et al., 2018) (3000 iterations for 5 times average) and Noise-As-Clean (NAC) (Xu et al., 2019) are chosen for comparison on CC datasets. FFDNet+ is a multi-scale extension of FFDNet (Zhang et al., 2018) and DnCNN+ is a color version of DnCNN and fine tuned with the FFDNet+ resultsfoot_0 . The denoising results are listed in Table 2 . It is shown that NN+BM3D is better than deep neural networks trained on a dataset and other unsupervised deep learning methods. The capacity of decoder. The proposed model (4) has a trivial global minimum when assuming (a) the decoder is a constant mapping, i.e. F(z) = y for all z; (b) G(y) = x = 0. We test the capacity of the decoder used in our model by fitting the map between the constant image and the noisy image, and the training loss versus iteration is reported in Figure 5 (a) . Moreover, the average training loss on PolyU dataset is 294.56, which shows that our decoder is not large enough and the trivial minimum does not exist, see more explanation in appendix A.2. Training stability. Four classic images including Kodim03 (red), Kodim02 (green), Lena (blue) and Peppers (yellow) with noise level σ = 30 are used for testing the training stability of our method. The PSNR value versus iteration number is reported in Figure 5 (b) and it shows that the PSNR keeps increasing and is stable after a certain number of iterations. Latent space evolution. One image in the introduction from SIDD dataset is used to show the latent image evolution in our method. In particular, we evaluate NN+BM3D for 20 iterations and show 8 latent images and their patch noise distribution evolution in Figure 6 . It is shown that the noise distribution gradually changes from non-Gaussian to Gaussian. Hyperparameter sensitivity. The sensitivity of hyperparameters are given in Figure 5 (c), (d) and (e), where we change one of ρ, σ and η each time and fix the others as ρ = 1, σ = 5 and η = 0.5. It can be observed that our method is not sensitive to different hyperparameters.

6. CONCLUSION

In this paper, we propose a NN based method that maps the complex noisy distribution in real-world images into a latent space in which the AWGN assumption holds. Combined with any existing Gaussian denoising approaches, it improves the denoising results by a large margin. More importantly, this method does not require any training sample except the input noisy image itself. Extensive results validate the rationale of the proposed network training scheme and show the advantages of our method compared with existing representative approaches including learning and non-learning based methods. Yuqian Zhou, Jianbo Jiao, Haibin Huang, Yang Wang, Jue Wang, Honghui Shi, and Thomas Huang. When awgn-based denoiser meets real noises. arXiv:1904.03485, 2019. Fengyuan Zhu, Guangyong Chen, and Pheng-Ann Heng. From noise modeling to blind image denoising. In CVPR, pp. 420-429, 2016. A PROOF OF PROPOSITIONS 1 We show the derivation of Proposition 1 and give one theoretical explanation of the capacity of decoder. A.1 PROPOSITION 1 The ELBO in Proposition 1 is defined as E q(z|y) ln p(y|z, x) -KL(q(z|y)||p(z|x)). Since z|x ∼ N (x, σ 2 I), y|z, x ∼ N (F(z), I) and q(z|y) = N (G(y), I). Assume the image size of x is m×n×1 for gray scale image, then the KL divergence of two Gaussian distribution is KL(q(z|y)||p(z|x)) = E q(z|y) ln q(z|y) -ln p(z|x) = mn 2 ln σ 2 + 1 2 E q(z|y) -||z -G(y)|| 2 + 1 σ 2 ||z -x|| 2 = mn 2 ln σ 2 - mn 2 + 1 2σ 2 E q(z|y) ||z -x|| 2 = mn 2 ln σ 2 - mn 2 + 1 2σ 2 E q(z|y) ||z -G(y) + G(y) -x|| 2 = mn 2 ln σ 2 - mn 2 + 1 2σ 2 mn + ||G(y) -x|| 2 = mn 2 ln σ 2 -1 + 1 σ 2 + 1 σ 2 ||G(y) -x|| 2 . Using the reparameterization in (Kingma & Welling, 2013) , z|y = G(y) + where ∼ N (0, I), the expectation term is E q(z|y) ln p(y|z, x) = - mn 2 ln 2π - 1 2 E q(z|y) ||F(z) -y|| 2 = - mn 2 ln 2π - 1 2 E ||F(G(y) + ) -y|| 2 Then the ELBO = E q(z|y) ln p(y|z, x) -KL(q(z|y)||p(z|x)) is - 1 2 E ||F(G(y) + ) -y|| 2 - 1 2σ 2 ||G(y) -x|| 2 - mn 2 ln 2π + ln σ 2 -1 + 1 σ 2 , and for RGB images the constant term becomes -3mn 2 ln 2π + ln σ 2 -1 + 1 σ 2 , thus Proposition 1 holds.

A.2 THE CAPACITY OF DECODER

Here we give an explanation regarding to capacity of decoder in the following proposition. Proposition 2 Suppose x ∈ R m×n×c is a constant image, i.e. x i,j,k = x i ,j ,k , ∀i, i ∈ {1, 2, ..., m}, j, j ∈ {1, 2, ..., n}, k ∈ {1, 2, ..., c}, and F is a deep neural network that composted with convolution layers (with reflect/replicate padding), down-sampling layers (with max/average pooling), and up-sampling layer (with bilinear/nearest point interpolation) then F(x) is constant. To simplify notations, we denote convolution layers as C, down-sampling layers as D, and upsampling layer as U , then we only need to show for constant input x ∈ R m×n×c , C(x), D(x) and U (x) are constants. Assume C(x) = x ∈ R m ×n ×c , we need to show xi,j,k = xi ,j ,k . From the dimension of x, there are c convolution kernels denote as K i ∈ R pi×qi×c , i = 1, 2, ..., c , then xi,j,k = l1,l2,l3 K k l1,l2,l3 x i-p k -1 2 +l1,j- q k -1 2 +l2,l3 = l1,l2,l3 K k l1,l2,l3 x i -p k -1 2 +l1,j - q k -1 2 +l2,l3 = xi ,j ,k , so C(x) is constant. For down-sampling layer, since D(x) i,j,k = max{x i,j,k |∀i, j ∈ R} for max pooling and D(x) i,j,k = 1 |R| i,j∈R x i,j,k for average pooling, where R denote support of pooling kernel. Since x is constant, then it is clear from above expression that P (x) is constant. Finally, for up-sampling layer, since we assume bilinear/nearest point interpolation, then U (x) i,j,k depends on the convex combination of {x i,j,k |∀i, j ∈ S}, where S denote the corresponding up-sampling region of U (x) i,j,k . So if x is constant then the convex combination of S is constant, then we have U (x) is constant. Thus proposition 2 holds. The above proposition shows that the latent image can not be a constant image and thus the regularization takes effects. Besides, the minimization problem is a non-convex problem which depends on the initialization. In our experiments, we initialize x to be the input noisy image y and it empirically converges to a reasonable good denoised image. 

B EXPERIMENTS ON SYNTHETIC NOISE

Additive white Gaussian noise. Even though our method is designed for the real-world image denoising, we evaluate the performance of this method on synthetic noises. We test two different Gaussian denoising methods, NLM and BM3D. Two datasets are used, one is the Set 9 (Ulyanov et al., 2018) with 9 classic color images and the other is BSD 68 (Krull et al., 2019a) with 68 grayscale images. We test four noisy levels: σ = 25, 50, 75, 100, and all noisy images are quantized into 8-bits for simulating the common JPG images. The results are given in Table 3 we find even though the noise distribution is Gaussian, two denoising methods can benefit from our approach more or less. Poisson noise. For noise distribution which is significantly different from Gaussian, we test the synthetic Poisson noise on CBSD 68 dataset with 68 color images by combining our method with VST transformation. The results are given in Table 4 , all noisy images are quantized into 8-bits. We find that even though VST is a reliable transformation for the pure Poisson noise case, the neural networks are still helpful for improving the VST method. For DnCNN, we find it sensitive to Poisson noise, for Peak value equals to 5, the result of VST+DnCNN is 17.23/0.3767 and our method (NN+VST+DnCNN) improves 0.60/0.0629 for it on average. 

C EXPERIMENTS ON REAL WORLD NOISE

In this section, We compare the performance of NN+BM3D with other denoising method on four real-world noisy image datasets named as CC (Nam et al., 2016) , PolyU (Xu et al., 2018a) , DND (Plotz & Roth, 2017) and SIDD (Abdelhamed et al., 2018) . we make a comprehensive comparison study of our method with many existing representative image denoising methods, including Weighted Nuclear Norm Minimization (WNNM) (Gu et al., 2014) , Nonlocally Centralized Sparse Representation (NCSR) (Dong et al., 2012a) , Trainable Nonlinear Reactive Diffusion (TNRD) (Chen et al., 2015b) , DnCNN (Zhang et al., 2017) , Multi-channel Weighted Nuclear Norm Minimization (MCWNNM) (Xu et al., 2017) , Trilateral Weighted Sparse Coding (TWSC) (Xu et al., 2018b) , "Noise Clinic" (NC) method (Lebrun et al., 2015a ) and a commercial software Neat Image (NI) (ABSoft, 2017). See Table 5 for the averaged PSNR/SSIM results. The results of comparison methods in CC and PolyU datasets are from (Xu et al., 2018a) and their results of DND and SIDD datasets are from their official project website. In addition, we compare the performance of NN+BM3D with Three top blind image denoising networks including DnCNN+ (Zhang et al., 2017) , FFDNet+ (Zhang et al., 2018) and CBDNet (Guo et al., 2019) . According to the results in DND and SIDD benchmarks, the denoising results are listed in Table 6 . The symbol "-" is used when the result of the corresponding method is not reported. We also compare the performance of NN+BM3D with supervised/unsupervised methods in FMDD dataset (Zhang et al., 2019) . We compared with three unsupervised methods named as VST+BM3D (Makitalo & Foi, 2012) , PURE-LET (Luisier et al., 2010) and TWSC (Xu et al., 2018b) , two supervised methods DnCNN (Zhang et al., 2017) and N2N (Lehtinen et al., 2018) which are retrained on the FMDD training set. The denoising results are listed in Table 7 . Although NN+FFDNet. We replaced DnCNN with FFDNet, and the result on CC is given in Table 9 , it is shown that FFDNet benefits more significantly from our method and is better than the results of its multi-scale version FFDNet+.

E VISUAL RESULTS ON REAL WORLD NOISE

We show visual results on real world noisy images in E.1 and real fluorescence microscopy images in E.2. Moreover, ten real noisy images are captured by consumer cameras with ISO=3200 or 320, similar to the CC dataset, we crop a 512 × 512 region in each image to evaluate the performance NN+BM3D, see E.3.

E.1 VISUAL EXAMPLE OF CC, POLYU

We show visual results of ten noisy images from CC (Nam et al., 2016) and PolyU (Xu et al., 2018a) datasets. BM3D (Dabov et al., 2007a) , DnCNN (Zhang et al., 2017) , NC (Lebrun et al., 2015b) , MCWNNM (Xu et al., 2017) and TWSC (Xu et al., 2018b) are evaluated for comparison. See CC's results in page 16, 17 and PolyU's results in page 18, 19.

E.2 VISUAL EXAMPLE OF FMDD

Two images from FMDD Zhang et al. (2019) datasets are evaluated for visual comparisons. We compared our approach with VST method (Makitalo & Foi, 2012) , See page 20.

E.3 VISUAL EXAMPLES OF REAL IMAGE

Ten real noisy images are evaluated for visual comparisons. The results of three traditional methods (BM3D (Dabov et al., 2007a) , MCWNNM (Xu et al., 2017) and NC (Lebrun et al., 2015b) ) and three deep learning methods (VDN (Yue et al., 2019) , DnCNN (Zhang et al., 2017) and FFDNet (Zhang et al., 2018) ) are shown here. See Figure 10 



The results of DnCNN+, FFDNet+ and CBDNet are from(Hou et al., 2019) and the results of N2N and NAC are from(Xu et al., 2019).



Figure 1: Two real noisy images. (a) Clean images. (b) Noisy Images. (c) Noisy distribution in red, green and yellow patches. (d) BM3D results (PSNR: 26.55 (top) and 29.41 (bottom)). (e) NN+BM3D (PSNR: 27.53 (top) and 30.05 (bottom)).

Figure 2: The workflow of our method. Middle arrows indicate the alternating optimization between network training and Gaussian denoising in latent space.

, deep neural networks including Guo et al. (2019); Yu et al. (2019); Zhang et al. (2017); Zhou et al. (2019); Yue et al. (2019) have shown promising results on these datasets. However, these networks require many noisy/clean training pairs which limit their practical applications especially when the labeling work needs domain experts.

Figure 3: Visual results and PNSR/SSIM of image from CC.

Figure 4: (a) Noisy images with three selected patches. (b) Noisy distribution in image space. (c) Noisy distribution after VST transformation. (d-e) Noisy distribution in latent space by NN+NLM and NN+BM3D respectively.

Figure 5: (a) is the training losses of our decoder using constant input for five images from PolyU. (b) show PSNR value versus the number of iterations in NN+BM3D. (c), (d) and (e) are results for different hyperparameters ρ, σ and η respectively.

Figure 6: Evaluation of the latent representation in Algorithm 1 with NN+BM3D.

Figure 9: Visual results and PNSR/SSIM of two noisy images from FMDD.

Figure 10: Visual results for real-world image denoising.

Averaged PSNR and SSIM on CC, PolyU and FMDD.

Comparison with supervised/unsupervised methods on CC.

Average PSNR/SSIM of denoised results on Set9 and BSD 68.

Average PSNR/SSIM of Poisson denoising results on CBSD 68.

Averaged PSNR/SSIM on CC, PolyU, DND, SIDD.

Comparison with supervised methods.

Comparison with supervised/unsupervised methods on FMDD.

Averaged PSNR/SSIM of different architectures of encoder/decoder on CC dataset.

in page 21, 22, 23.Figure 7: Visual results and PNSR/SSIM of five noisy images from CC.

ACKNOWLEDGEMENTS

This work is supported by the National Natural Science Foundation of China (11901338, 12071244), Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403 and Tsinghua University Initiative Scientific Research Program.

