EVALUATING UNSUPERVISED DENOISING REQUIRES UNSUPERVISED METRICS

Abstract

Unsupervised denoising is a crucial challenge in real-world imaging applications. Unsupervised deep-learning methods have demonstrated impressive performance on benchmarks based on synthetic noise. However, no metrics are available to evaluate these methods in an unsupervised fashion. This is highly problematic for the many practical applications where ground-truth clean images are not available. In this work, we propose two novel metrics: the unsupervised mean squared error (MSE) and the unsupervised peak signal-to-noise ratio (PSNR), which are computed using only noisy data. We provide a theoretical analysis of these metrics, showing that they are asymptotically consistent estimators of the supervised MSE and PSNR. Controlled numerical experiments with synthetic noise confirm that they provide accurate approximations in practice. We validate our approach on realworld data from two imaging modalities: videos in raw format and transmission electron microscopy. Our results demonstrate that the proposed metrics enable unsupervised evaluation of denoising methods based exclusively on noisy data.

1. INTRODUCTION

Image denoising is a fundamental challenge in image and signal processing, as well as a key preprocessing step for computer vision tasks. Convolutional neural networks achieve state-of-the-art performance for this problem, when trained using databases of clean images corrupted with simulated noise Zhang et al. (2017a) . However, in real-world imaging applications such as microscopy, noiseless ground truth videos are often not available. This has motivated the development of unsupervised denoising approaches that can be trained using only noisy measurements Lehtinen et al. (2018) ; Xie et al. (2020); Laine et al. (2019); Sheth et al. (2021); Huang et al. (2021) . These methods have demonstrated impressive performance on natural-image benchmarks, essentially on par with the supervised state of the art. However, to the best of our knowledge, no unsupervised metrics are currently available to evaluate them using only noisy data. Reliance on supervised metrics makes it very challenging to create benchmark datasets using realworld measurements, because obtaining the ground-truth clean images required by these metrics is often either impossible or very constraining. In practice, clean images are typically estimated through temporal averaging, which suppresses dynamic information that is often crucial in scientific applications. Consequently, quantitative evaluation of unsupervised denoising methods is currently almost completely dominated by natural image benchmark datasets with simulated noise Lehtinen et al. The lack of unsupervised metrics also limits the application of unsupervised denoising techniques in practice. In the absence of quantitative metrics, domain scientists must often rely on visual inspection to evaluate performance on real measurements. This is particularly restrictive for deep-learning approaches, because it makes it impossible to perform systematic hyperparameter optimization and model selection on the data of interest. In this work, we propose two novel unsupervised metrics to address these issues: the unsupervised mean-squared error (uMSE) and the unsupervised peak signal-to-noise ratio (uPSNR), which are computed exclusively from noisy data. These metrics build upon existing unsupervised denoising methods, which minimize an unsupervised cost function equal to the difference between the denoised estimate and additional noisy copies of the signal of interest Lehtinen et al. (2018) . The uMSE is equal to this cost function modified with a correction term, which renders it an unbiased estimator of the supervised MSE. We provide a theoretical analysis of the uMSE and uPSNR, proving that they are asymptotically consistent estimators of the supervised MSE and PSNR respectively. Controlled experiments on supervised benchmarks, where the true MSE and PSNR can be computed exactly, confirm that the uMSE and uPSNR provide accurate approximations. In addition, we validate the metrics on video data in RAW format, contaminated with real noise that does not follow a known predefined model. In order to illustrate the potential impact of the proposed metrics on imaging applications where no ground-truth is available, we apply them to transmission-electron-microscopy (TEM) data. Recent advances in direct electron detection systems make it possible for experimentalists to acquire highly time-resolved movies of dynamic events at frame rates in the kilohertz range Faruqi & McMullan (2018) ; Ercius et al. (2020) , which is critical to advance our understanding of functional materials. Acquisition at such high temporal resolution results in severe degradation by shot noise. We show that unsupervised methods based on deep learning can be effective in removing this noise, and that our proposed metrics can be used to evaluate their performance quantitatively using only noisy data. To summarize, our contributions are (1) two novel unsupervised metrics presented in Section 3, (2) a theoretical analysis providing an asymptotic characterization of their statistical properties (Section 4), (3) experiments showing the accuracy of the metrics in a controlled situation where ground-truth clean images are available (Section 5), ( 4) validation on real-world videos in RAW format (Section 6), and ( 5) an application to a real-world electron-microscopy dataset, which illustrates the challenges of unsupervised denoising in scientific imaging (Section 7).

Unsupervised denoising

The past few years have seen ground-breaking progress in unsupervised denoising, pioneered by Noise2Noise, a technique where a neural network is trained on pairs of noisy images Lehtinen et al. (2018) . Our unsupervised metrics are inspired by Noise2Noise, which optimizes a cost function equal to our proposed unsupervised MSE, but without a correction term (which is not needed for training models). Subsequent work focused on performing unsupervised denoising from single images using variations of the blind-spot method, where a model is trained to estimate each noisy pixel value using its neighborhood but not the noisy pixel itself (to avoid the trivial identity solution) Krull et al. ( 2019 2021), an insight that can also be leveraged in combination with our proposed metrics, as explained in Section C. Our contribution with respect to these methods is a novel unsupervised metric that can be used for evaluation, as it is designed to be an unbiased and consistent estimator of the MSE. 2021). In principle, SURE could be used to compute the MSE for evaluation, but it has certain limitations: (1) a closed form expression of the noise likelihood is required, including the value of the noise parameters (for example, this is not known for the real-world datasets in Sections 6 and 7), (2) computing SURE requires approximating the divergence of a denoiser (usually via Monte Carlo methods Ramani et al. ( 2008)), which is computationally very expensive. Developing practical unsupervised metrics based on SURE and studying their theoretical properties is an interesting direction for future research. 



(2018); Xie et al. (2020); Laine et al. (2019); Sheth et al. (2021); Huang et al. (2021), which are not always representative of the signal and noise characteristics that arise in real-world imaging applications.

); Laine et al. (2019); Batson & Royer (2019); Sheth et al. (2021); Xie et al. (2020). More recently, Neighbor2Neighbor revisited the Noise2Noise method, generating noisy image pairs from a single noisy image via spatial subsampling Huang et al. (

Stein's unbiased risk estimator (SURE) provides an asymptotically unbiased estimator of the MSE for i.i.d. Gaussian noise Donoho & Johnstone (1995). This cost function has been used for training unsupervised denoisers Metzler et al. (2018); Soltanayev & Chun (2018); Zhussip et al. (2019); Mohan et al. (

In the literature, quantitative evaluation of unsupervised denoising techniques has mostly relied on images and videos corrupted with synthetic noise Lehtinen et al. (2018); Krull et al. (2019); Laine et al. (2019); Batson & Royer (2019); Sheth et al. (2021); Xie et al. (2020). Recently, a few datasets containing real noisy data have been created Abdelhamed et al. (2018); Plotz & Roth (2017); Xu et al. (2018); Zhang et al. (2019). Evaluation on these datasets is based on supervised MSE and PSNR computed from estimated clean images obtained by averaging

