NERDS: A GENERAL FRAMEWORK TO TRAIN CAM-ERA DENOISERS FROM RAW-RGB NOISY IMAGE PARIS

Abstract

We aim to train accurate denoising networks for smartphone/digital cameras from raw-RGB noisy image pairs. Downscaling is commonly used as a practical denoiser for low-resolution images. Based on this processing, we found that the pixel variance of natural images is more robust to downscaling than the pixel variance of camera noise. Intuitively, downscaling removes high-frequency noise more easily than natural textures. To utilize this property, we can adopt noisy/clean image synthesis at low-resolution to train camera denoisers. On this basis, we propose a new solution pipeline -NERDS that estimates camera noise and synthesizes noisy-clean image pairs from only noisy images. In particular, it first models the noise in raw-sensor images as Poisson-Gaussian distributions, then estimates noise parameters using the difference of pixel variances by downscaling. We formulate the noise estimation as a gradient-descent-based optimization problem through a reparametrization trick. We further introduce a new Image Signal Processor (ISP) estimation method that enables denoiser training in a human-readable RGB space by transforming the downscaled raw images to the style of a given RGB noisy image. The noise and ISP estimations utilize rich augmentation to synthesize image pairs for denoiser training. Experiments show that NERDS can accurately train CNN-based denoisers (e.g., DnCNN, ResNet-style network) outperforming previous noise-synthesis-based and self-supervision-based denoisers in real datasets.

1. INTRODUCTION

Image denoising is a conventional machine learning problem restoring original colors and patterns from noisy images. Deep-learning-based approaches have achieved breakthroughs in recent decades due to the power of neural networks. Early works (55; 38; 45) have successfully removed additive white Gaussian noise (AWGN), which allows network training under supervision by synthesizing noisy-clean image pairs. Nevertheless, denoising images captured by smartphone/digital cameras poses an obstacle, as it is difficult to obtain clean images for noisy images with pixel-level alignment. Several works (2; 7) constructed datasets with the noisy-clean pairs for real-world images. Using these pairs (Figure 1(a)), many supervised-learning-based denoisers (51; 28; 52; 24; 15) restore crisp images on benchmarks from the datasets. However, constructing such datasets requires tightly controlled capturing environments, complicated post-processing, and massive human labor. To overcome the drawback of plain supervised learning, two major types of research have been studied. The first line of works generates realistic noisy images from clean images to utilize supervised denoiser training as visualized in Figure 1(b) . Several approaches (14; 11; 23; 26) adopt generative models using unpaired noisy-clean images based on GAN (20), but they achieve limited accuracy on real noise. Some other works synthesize realistic noise using existing noisy-clean image pairs (53; 1) or metadata for real cameras (6; 22), but they are limited in generalization for unseen noise. The second category aims to learn denoisers without clean images. The first work (34) in this category proposed the learning framework using multiple noisy images. After that, many selfsupervised-learning approaches (5; 31; 10) use single noisy images (Figure 1(c )), which enable easy

