FULLY UNSUPERVISED DIVERSITY DENOISING WITH CONVOLUTIONAL VARIATIONAL AUTOENCODERS

Abstract

Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. Naturally, there are limitations to what can be restored in corrupted images, and like for all inverse problems, many potential solutions exist, and one of them must be chosen. Here, we propose DIVNOISING, a denoising approach based on fully convolutional variational autoencoders (VAEs), overcoming the problem of having to choose a single solution by predicting a whole distribution of denoised images. First we introduce a principled way of formulating the unsupervised denoising problem within the VAE framework by explicitly incorporating imaging noise models into the decoder. Our approach is fully unsupervised, only requiring noisy images and a suitable description of the imaging noise distribution. We show that such a noise model can either be measured, bootstrapped from noisy data, or co-learned during training. If desired, consensus predictions can be inferred from a set of DIVNOISING predictions, leading to competitive results with other unsupervised methods and, on occasion, even with the supervised state-of-the-art. DIVNOISING samples from the posterior enable a plethora of useful applications. We are piq showing denoising results for 13 datasets, piiq discussing how optical character recognition (OCR) applications can benefit from diverse predictions, and are piiiq demonstrating how instance cell segmentation improves when using diverse DIVNOISING predictions.

1. INTRODUCTION

The goal of scientific image analysis is to analyze pixel-data and measure the properties of objects of interest in images. Pixel intensities are subject to undesired noise and other distortions, motivating an initial preprocessing step called image restoration. Image restoration is the task of removing unwanted noise and distortions, giving us clean images that are closer to the true but unknown signal. In the past years, Deep Learning (DL) has enabled tremendous progress in image restoration (Mao et al., 2016; Zhang et al., 2017b; Zhang et al., 2017; Weigert et al., 2018) . Supervised DL methods use corresponding pairs of clean and distorted images to learn a mapping between the two quality levels. The utility of this approach is especially pronounced for microscopy image data of biological samples (Weigert et al., 2017; 2018; Ouyang et al., 2018; Wang et al., 2019) , where quantitative downstream analysis is essential. More recently, unsupervised content-aware image restoration ˚Shared first authors. : Shared last authors. (bottom) After training, the encoder can be used to sample multiple z k " q φ pz|xq, giving rise to diverse denoised samples s k . These samples can further be used to infer consensus point estimates such as a MMSE or a MAP solution. (CARE) methods (Lehtinen et al., 2018; Krull et al., 2019; Batson & Royer, 2019; Buchholz et al., 2019) have emerged. They can, enabled by sensible assumptions about the statistics of imaging noise, learn a mapping from noisy to clean images, without ever seeing clean data during training. Some of these methods additionally include a probabilistic model of the imaging noise (Krull et al., 2020; Laine et al., 2019; Prakash et al., 2020; Khademi et al., 2020) to further improve their performance. Note that such denoisers can directly be trained on a given body of noisy images. All existing approaches have a common flaw: distortions degrade some of the information content in images, generally making it impossible to fully recover the desired clean signal with certainty. Even an ideal method cannot know which of many possible clean images really has given rise to the degraded observation at hand. Hence, any restoration method has to make a compromise between possible solutions when predicting a restored image. Generative models, such as VAEs, are a canonical choice when a distribution over a set of variables needs to be learned. Still, so far VAEs have been overlooked as a method to solve unsupervised image denoising problems. This might also be due to the fact that vanilla VAEs (Kingma & Welling, 2014; Rezende et al., 2014) show sub-par performance on denoising problems (see Section 6). Here we introduce DIVNOISING, a principled approach to incorporate explicit models of the imaging noise distribution in the decoder of a VAE. Such noise models can be either measured or derived (bootstrapped) from the noisy image data alone (Krull et al., 2020; Prakash et al., 2020) . Additionally we propose a way to co-learn a suitable noise model during training, rendering DIVNOISING fully unsupervised. We show on 13 datasets that fully convolutional VAEs, trained with our proposed DIVNOISING framework, yield competitive results, in 8 cases actually becoming the new state-ofthe-art (see Fig. 2 and Table 1 ). Still, the key benefit of DIVNOISING is that the method does not need to commit to a single prediction, but is instead capable of generating diverse samples from an approximate posterior of possible true signals. (Note that point estimates can still be inferred if desired, as shown in Fig. 4 .) Other unsupervised denoising methods only provide a single solution (point estimate) of that posterior (Krull et al., 2019; Lehtinen et al., 2018; Batson & Royer, 2019) or predict an independent posterior distribution of intensities per pixel (Krull et al., 2020; Laine et al., 2019; Prakash et al., 2020; Khademi et al., 2020) . Hence, DIVNOISING is the first method that learns to approximate the posterior over meaningful structures in a given body of images. We believe that DIVNOISING will be hugely beneficial for computational biology applications in biomedical imaging, where noise is typically unavoidable and huge datasets need to be processed on a daily basis. Here, DIVNOISING enables unsupervised diverse SOTA denoising while requiring only comparatively little computational resources, rendering our approach particularly practical. Finally, we discuss the utility of diverse denoising results for OCR and showcase it for a ubiquitous analysis task in biology -the instance segmentation of cells in microscopy images (see Fig. 5 ). Hence, DIVNOISING has the potential to be useful for many real-world applications and will not only generate state-of-the-art (SOTA) restored images, but also enrich quantitative downstream processing.

2. RELATED WORK

Classical Denoising. The denoising problem has been addressed by a variety of filtering approaches. Arguably some of the most prominent ones are Non-Local Means (Buades et al., 2005) and BM3D (Dabov et al., 2007) , which implement a sophisticated non-local filtering scheme. A comprehensive survey and in-depth discussion of such methods can be found in (Milanfar, 2012) . DL Based Denoising. Deep Learning methods which directly learn a mapping from a noisy image to its clean counterpart (see e.g. (Zhang et al., 2017a) and (Weigert et al., 2018) ) have outperformed classical denoising methods in recent years. Two well known contributions are the seminal works by Zhang et al. (2017a) and later by Weigert et al. (2018) . More recently, a number of unsupervised variations have been proposed, and in Section 1 we have described their advantages and disadvantages in detail. One additional interesting contribution was made by Ulyanov et al. (2018) , introducing a quite different kind of unsupervised restoration approach. Their method, Deep Image Prior, trains a network separately for each noisy input image in the training set, making this approach computationally rather expensive. Furthermore, training has to be stopped after a suitable but a priori unknown number of training steps. Recently, Quan et al. (2020) proposed an interesting method called SELF2SELF which trains a U-NET like architecture requiring only single noisy images. The key idea of this approach is to use blind spot masking, similar to Krull et al. (2019) , together with dropout (Srivastava et al., 2014) , which avoids overfitting and allows sampling of diverse solutions. Similar to DIVNOISING, the single denoised result is obtained by averaging many diverse predictions. Diverse results obtained via dropout are generally considered to capture the so called epistemic or model uncertainty (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017) , i.e. the uncertainty arising from the fact that we have a limited amount of training data available. In contrast, DIVNOISING combines a VAE and a model of the imaging noise to capture what is known as aleatoric or data uncertainty (Böhm et al., 2019; Sensoy et al., 2020) , i.e. the unavoidable uncertainty about the true signal resulting from noisy measurements. Like in Ulyanov et al. (2018) , also SELF2SELF trains separately on each image that has to be denoised. While this renders the method universally applicable, it is computationally prohibitive when applied to large datasets. The same is true for real time applications such as facial denoising. DIVNOISING, on the other hand, is trained only once on a given body of data. Afterwards, it can be efficiently applied to new images. A detailed comparison of SELF2SELF and DIVNOISING in terms of denoising performance, run time and GPU memory requirements can be found in Appendix A.14 and Appendix Table 2 . Denoising (Variational) Autoencoders. Despite the suggestive name, denoising variational autoencoders (Im et al., 2017) are not solving denoising problems. Instead, this method proposes to add noise to the input data in order to boost the quality of encoder distributions. This, in turn, can lead to stronger generative models. Other methods also follow a similar approach to improve overall performance of autoencoders (Vincent et al., 2008; 2010; Jiao et al., 2020) . VAEs for Diverse Solution Sampling. Although not explored in the context of unsupervised denoising, VAEs are designed to sample diverse solutions from trained posteriors. The probabilistic U-NET (Kohl et al., 2018; 2019) uses conditional VAEs to learn a conditional distribution over segmentations. Baumgartner et al. (2019) improve the diversity of segmentation samples by introducing a hierarchy of latent variables to model segmentations at multiple resolutions. Unlike DIVNOISING, both methods rely on paired training data. Nazabal et al. (2020) employ VAEs to learn the distribution of incomplete and heterogeneous data in a fully unsupervised manner. Babaeizadeh et al. (2017) build upon a VAE style framework to predict multiple plausible future frames of videos conditioned on given context frames. A variational inference approach was used by Balakrishnan et al. (2019) to generate multiple deprojected samples for images and videos collapsed in either spatial or temporal dimensions. Unlike all these approaches, we address the uncertainty introduced by common imaging noise and show how denoised samples can improve downstream processing.

3. THE DENOISING TASK

Image restoration is the task of estimating a clean signal s " ps 1 , . . . , s N q from a corrupted observation x " px 1 , . . . , x N q, where s i and x i , refer to the respective pixel intensities. The corrupted x is thought to be drawn from a probability distribution p NM px|sq, which we call the observation likelihood or the noise model. In this work we focus on restoring images that suffer from insufficient illumination and detector/camera imperfections. Contrary to existing methods, DIVNOISING is designed to capture the inherent uncertainty of the denoising problem by learning a suitable posterior distribution. Formally, the posterior we are interested in is pps|xq9ppx|sqppsq and depends on two components: the prior distribution ppsq of the signal as well as the observation likelihood p NM px|sq we introduced above. While the prior is a highly complex distribution, the likelihood ppx|sq of a given imaging system (camera/microscope) can be described analytically (Krull et al., 2020) . Models of Imaging Noise. The noise model is usually thought to factorize as a product of pixels, implying that the corruption, given the underlying signal, is occurring independently in each pixel as ppx|sq " N ź i p NM px i |s i q. (1) This assumption is known to hold true for Poisson shot noise and camera readout noise (Zhang et al., 2019; Krull et al., 2020; Prakash et al., 2020) . We will refer to the probability p NM px i |s i q of observing a particular noisy value x i at a pixel i given clean signal s i as the pixel noise model. Various types of pixel noise models have been proposed, ranging from physics based analytical models (Zhang et al., 2019; Luisier et al., 2010; Foi et al., 2008) to simple histograms (Krull et al., 2020) . In this work, we follow the Gaussian Mixture Model (GMM) based noise model description of (Prakash et al., 2020) . The parameters of a noise model can be estimated whenever pairs px 1 , s 1 q of corresponding noisy and clean calibration images are available (Krull et al., 2020 ). The signal s 1 « 1 M ř M j"0 x 1j can then be computed by averaging these noisy observations (Prakash et al., 2020) . In a case where no calibration data can be acquired, s 1 can be estimated by a bootstrapping approach (Prakash et al., 2020) . Later, we additionally show how a suitable noise model can be co-learned during training.

4. THE VARIATIONAL AUTOENCODER (VAE)

We want to briefly introduce the VAE approach introduced by Kingma & Welling (2014). A more complete introduction to the topic can be found in (Doersch, 2016; Kingma & Welling, 2019) . VAEs are generative models, capable of learning complex distributions over images x, such as hand written digits (Kingma & Welling, 2014) or faces (Huang et al., 2018) . To achieve this, VAEs use a latent variable z with a fixed (usually a unit normal distribution) prior ppzq and describe p θ pxq " ż p θ px|zqppzqdz. Like conventional autoencoders, they consist of two components: A decoder network g θ pzq that takes a point in latent space and maps it to a distribution p θ px|zq in image space and an encoder network f φ pxq, which takes an observed image and maps it to a distribution q φ pz|xq in latent space. By φ and θ, we denote network parameters of the encoder and decoder, respectively. Note that the decoder alone (together with a suitable prior ppzq) is sufficient to completely describe the generative model in Eq. 2. It is usually modelled to factorize over pixels p θ px|zq " N ź i"1 p θ px i |zq, where p θ px i |zq is a normal distribution, with its mean and variance predicted by the decoder network network g θ pxq. The encoder distribution is modelled in a similar fashion, factorizing over the dimensions of the latent space. Training for VAEs consists of adjusting the parameters θ to make sure that Eq. 2 fits the distribution of training images x. Kingma et al. show that this can be achieved with the help of the encoder by jointly optimizing φ and θ to minimize the loss L φ,θ pxq " L R φ,θ pxq `LKL φ pxq, where L R φ,θ pxq " E q φ pz|xq r´log p θ px|zqs " E q φ pz|xq " ř N i"1 ´log p θ px i |zq ı , and L KL φ pxq is the KL divergence KL pq φ pz|xq||ppzqq. While L KL φ pxq can be computed analytically, the expected value in L R φ,θ pxq is approximated by drawing a single sample z 1 from q φ pz|xq and using the reparametrization trick by Kingma & Welling (2014) for gradient computation. 

5. DIVNOISING

In DIVNOISING, we build on the VAE setup but interpret it from a denoising-specific perspective. We assume that images have been created from a clean signal s via a known noise model, i.e., x " p NM px|sq. To account for this within the VAE setup, we replace the generic normal distribution over pixel intensities in Eq. 3 with a known noise model p NM px|sq (see Eq. 1). We get p θ px|zq " p NM px|sq " ś N i p NM px i |s i q, with the decoder now predicting the signal g θ pzq " s. Together with ppzq and the noise model, the decoder now describes a full joint model for all three variables, including the signal: p θ pz, x, sq " p NM px|sqp θ ps|zqppzq, where we assume that p NM px|s, zq " p NM px|sq. For a given z k , as for standard VAEs, the decoder describes a distribution over noisy images ppx|zq. The corresponding clean signal s k , in contrast, is deterministically defined. Hence, p θ ps|zq is a Dirac distribution centered at g θ pzq. Training. Considering Eq. 1, the reconstruction loss becomes L R φ,θ pxq " E q φ pz|xq " ř N i"1 ´log ppx i |s " g θ pzqq ı . Apart from this modification, we can follow the standard VAE training procedure, just as described in Section 4. Since we have only modified how the decoder distribution is modeled, we can assume that the training procedure still produces piq a model describing the distribution of our training data, while piiq making sure that the encoder distribution well approximates the distribution of the latent variable given the image. A complete derivation of the DIVNOISING loss (from probability model perspective) can be found in Appendix A.12. For all experiments, we compare all results in terms of mean Peak Signal-to-Noise Ratio (PSNR in dB) and ˘1 standard error over 5 runs. Overall best performance indicated by being underlined, best unsupervised method in bold, and best fully unsupervised method in italic. For many datasets, DIVNOISING is the unsupervised SOTA, typically not being far behind the supervised CARE results. Prediction. While we can use the trained VAE to generate images from p θ pxq (see Appendix A.5), here we are mainly interested in denoising. Hence, we desire access to the posterior pps|xq, i.e. the distribution of possible clean signals s given a noisy observation x. Assuming the encoder and decoder are sufficiently well trained, samples s k from an approximate posterior can be obtained by piq feeding the noisy image x into our encoder, piiq drawing samples z k " q φ pz|xq, and piiiq decoding the samples via the decoder to get s k " g θ pz k q. Inference. Given a set of posterior samples s k for a noisy image x, we can infer different consensus estimates (point estimates). We can, for example, approximate the MMSE estimate (see Fig. 2 ), by averaging many samples s k . Alternatively, we can attempt to find the maximum a posteriori (MAP) estimate, i.e. the most likely signal given the noisy observation x, by finding the mode of the posterior distribution. For this purpose, we iteratively use the mean shift algorithm (Cheng, 1995) with decreasing bandwidth to find the mode of our sample set (see Fig. 4 and Appendix A.4). Fully Unsupervised DivNoising. So far we explained our setup under the assumption that the noise model can either be measured with paired calibration images, or bootstrapped from noisy data (Prakash et al., 2020) . Here, we propose yet another alternative approach of co-learning the noise model directly from noisy data during training. More concretely, this is enabled by a simple modification to the DIVNOISING decoder. We assume that the noise at each pixel i follows a normal distribution with its variance being a linear function of s i , i.e., σ 2 i " as i `b. Linearity is motivated by noise properties in low-light settings (Faraji & MacLean, 2006; Jezierska et al., 2011) . The learnable network parameters a and b are co-optimized during training. Since variances cannot be negative, we additionally constrain the predicted values for σ 2 i to be positive (see Appendix A.3 for details). Denoising with Vanilla VAEs. While not originally intended for denoising tasks, we were curious to see how vanilla VAEs perform when applied to these problems. Just like fully unsupervised DIVNOISING, also the vanilla VAE does not require a noise model. It does, instead, directly predict per-pixel mean and variance (see Section 4), leaving the possibility open that the right values could be learned. However, here the decoder is not restricted to make each pixel's variance a function of predicted signal. We investigate the denoising performance of the vanilla VAE in Section 6 and show in Fig. 5 that the predicted variances significantly diverge from ground truth noise distributions. Signal Prior in DIVNOISING. Classical denoising methods often explicitly model the image/signal prior ppsq e.g. as smoothness priors (Grimson & Grimson, 1981; Li, 1994) , non-local similarity priors (Buades et al., 2005; Dabov et al., 2007) , sparseness priors (Tibshirani, 1996) etc., assuming specific properties of the images at hand. They effectively assign the same probability to all images/signals sharing e.g. the same level of smoothness. However, the true distribution ppsq of clean signals (e.g. for a particular experimental setup in a fluorescence microscope) is generally more complex. Instead of explicitly modelling ppsq, DIVNOISING only implicitly describes p θ psq " ş p θ ps|zqppzqdz as integral over all possible values of z. We recall that the prior ppzq is assumed to be the unit Gaussian distribution and the conditional distribution p θ ps|zq is learned by the decoder network as the Dirac distribution centered at g θ pzq. Depending on its parameters θ, the network will implement the function differently, leading to a different p θ ps|zq, and ultimately to a different p θ psq. This implicit distribution is quite powerful and can capture complex structures. See Appendix A.5 for samples obtained from this signal prior for different datasets.

6. DATA, EXPERIMENTS, RESULTS

We quantitatively evaluated the performance of DIVNOISING on 13 publicly available datasets (see Appendices A.1 and A.2 for data details), 9 of which are subject to high levels of intrinsic (real world) noise. To 4 others we synthetically added noise, hence giving us full knowledge about the nature of the added noise. Denoising Baselines. We choose state-of-the-art baseline methods to compare against DIVNOISING, namely, the supervised CARE (Weigert et al., 2018) and the unsupervised methods NOISE2VOID (N2V) (Krull et al., 2019) and Probabilistic NOISE2VOID (PN2V) (Krull et al., 2019) . All baselines use the available implementations of (Krull et al., 2020) and, as long as not specified otherwise, make use of a depth 3 U-NET with 1 input channel and 64 channels in the first layer. As an additional baseline, we choose vanilla VAEs with the same network architecture as DIVNOISING, but predicting per pixel mean and variance independently. Training is performed using the ADAM (Kingma & Ba, 2015) optimizer for 200 epochs with 10 steps per epoch with a batch size of 4 and a virtual batch size of 20 for N2V and CARE and a batch size of 1 and a virtual batch size of 20 for PN2V, an initial learning rate of 0.001, and the same basic learning rate scheduler as in (Krull et al., 2020) . All baselines use on the fly data augmentation (flipping and rotation) during training. Training Details. In all experiments we use rather small, fully convolutional VAE networks, with either 200k or 713k parameters (see Appendix A.3). For all experiments on intrinsically noisy microscopy data, validation and test set splits follow the ones described in the respective publication. In contrast to the synthetically noisy data, no apriori noise model is known for microscopy datasets. For these datasets, we used GMM-based noise models (Prakash et al., 2020; Khademi et al., 2020) , which are measured from calibration images, as well as co-learned noise models. For the W2S datasets, no dedicated calibration samples to create noise models are available. Hence, for this dataset, we use the available clean ground truth images and all noisy observations of the training data to learn a GMM-based noise model. All GMM noise models use 3 Gaussians and 2 coefficients each. Find more training details in Appendix A.3. Denoising Results. In Table 1 , we report denoising performance of all experiments we conducted in terms of peak signal-to-noise ratio (PSNR) with respect to available ground truth images. The DIVNOISING results (using the MMSE estimate from 1000 averaged samples) are typically either on par or even beyond the denoising quality reached by the baselines in the 'fully unsupervised' category, as well as the 'unsupervised with noise model' category. Note that sampling is very efficient. For all presented experiments sampling 1000 images consistently took less than 7 seconds (see DIVNOISING MMSE is typically, as expected, slightly behind the performance of the fully supervised baseline CARE (Weigert et al., 2018) . Additionally, on FU-PN2V Convallaria we have demonstrated that a suitable noise model for DIVNOISING can be created via bootstrapping (Prakash et al., 2020; Khademi et al., 2020) . We also compare against Deep Image Prior (DIP) on DenoiSeg Flywing dataset as it has smallest number of test images and DIP has to be trained for each image. DIP achieves PSNR of 24.67 ˘0.050dB compared to 25.02 ˘0.024dB with DIVNOISING. Due to the extensive computational requirements of SELF2SELF, we cannot run the method on all images in any of our dataset. Instead, we run it on single, randomly selected images from the FU-PN2V Convallaria, FU-PN2V Mouse actin, FU-PN2V Mouse nuclei, and W2S Ch.1 (avg1) datasets. We compare SELF2SELF to DIVNOISING when piq trained on the same randomly chosen image from the respective dataset, and piiq when DIVNOISING was trained on the entire dataset and applied on the respective randomly selected image. Within a generous time limit of 10 hours for training per image, DIVNOISING still outperforms SELF2SELF in measured PSNR performance while requiring about 7 times less GPU memory (see Appendix A.14 and Appendix Table 2 ). Note that the application of SELF2SELF to an entire dataset containing 100 images would require 1000 hours of cumulative training time, while an overall 10 hour training of DIVNOISING on the entire dataset is sufficient to denoise all contained images. The performance on the natural image benchmark dataset BSD68 (Roth & Black, 2005) is shown in Fig. 26 and discussed in Appendix A.10. Additional qualitative results for all datasets can be found in Appendix A.9. A discussion on the accuracy of the posterior modeled by DIVNOISING can be found in Appendix A.11. Downstream Processing: OCR. In Fig. 4 we show how Optical Character Recognition (OCR) applications might benefit from diverse denoising. While regular denoising approaches predict poor compromises that would never be seen in clean text, DIVNOISING can generate a diverse set of rather plausible denoised solutions. While our MAP estimates clean up most such problems, occasional mistakes cannot be avoided, e.g. changing "hunger" to "hungor" (see Fig. 4 ). Diverse denoising solutions obtained by clustering typically correspond to plausible alternative interpretations. It stands to reason that OCR systems can benefit from having access to diverse interpretations. Downstream Processing: Instance Cell Segmentation. We demonstrate how diverse denoised images generated with DIVNOISING can help to segment all cells in the DenoiSeg Flywing data. While methods to generate diverse segmentations do exist (Kohl et al., 2018; 2019) , they require ground truth segmentation labels during training. In contrast, we use a simple and fast downstream segmentation pipeline cpxq based on local thresholding and skeletonization (see Appendix A.6 for details) and apply it to individual samples (sfoot_0 . . . s K ) predicted by DIVNOISING to derive segmentations (c 1 . . . c K ). We explore two label fusion methods to combine the individual results and obtain an improved segmentation. We do: piq use Consensus (BIC) (Emre Akbas et al., 2018) and piiq create a pixel-wise average of (c 1 . . . c K ), followed by again applying our threshold based segmentation procedure on this average, calling it Consensus (Avg). For comparison, we also segment piq the low SNR input images, piiq the original high SNR images, and piiiq the MMSE solutions of DIVNOISING. Figure 5 and Appendix Fig. 11 show all results of our instance segmentation experiments. It is important to note that segmentation from even a single DIVNOISING prediction outperforms segmentations on the low SNR image data quite substantially. We observe that label fusion methods can, by utilizing multiple samples, outperform the MMSE estimate, with Consensus (Avg) giving the best overall results (see Appendix Fig. 11 ).

7. DISCUSSION AND CONCLUSION

We have introduced DIVNOISING, a novel unsupervised denoising paradigm that allows us, for the first time, to generate diverse and plausible denoising solutions, sampled from a learned posterior. We have demonstrated that the quality of denoised images is highly competitive, typically outperforming the unsupervised state-of-the-art, and at times even improving on supervised results. 1 DIVNOISING uses a lightweight fully convolutional architecture. The success of Deep Image Prior (Ulyanov et al., 2018) shows that convolutional neural networks are inherently suitable for image denoising. Yokota et al. (2019) reinforce this idea and Tachella et al. (2020) additionally hypothesize that a possible reason for the success of convolutional networks is their similarity to non-local patch based filtering techniques. However, the overall performance of DIVNOISING is not merely a consequence of its convolutional architecture. We believe that the novel and explicit modeling of imaging noise in the decoder plays an essential role. This becomes evident when comparing our results to other convolutional baselines (including Deep Image Prior and fully convolutional VAEs), which do not perform as well as DIVNOISING on any of the datasets we used. Additionally, we observe that incorrect noise models consistently lead to inferior results (see Appendix A.7). We find that DIVNOISING is suited particularly well for microscopy data or other applications on a limited image domain. In its current form it works less well on collections of natural images (see Appendix A.10). This might not be very surprising, as we are training a generative model for our image data and would not expect to be capturing the tremendously diverse domain of natural photographic images with the comparatively tiny networks used in our experiments (see Appendix A.3). For microscopy data, instead, the diversity between datasets can be huge. Images of the same type of sample, acquired using the same experimental setup, however, contain many resembling structures of lesser overall diversity (they are from a limited image domain). Nevertheless, the stunning results we achieve suggest that DIVNOISING will also find application in other areas where low SNR limited domain image data has to be analyzed. Next to microscopy, we can think of astronomy, medical imaging, or limited domain natural images such as faces or street scenes. Additionally, follow up research will explore larger and improved network architectures, able to capture more complex DIVNOISING posteriors on datasets covering larger image domains. While we constrained ourselves to the standard per-pixel noise models in this paper, the DIVNOISING approach could in principle also work with more sophisticated higher level image degradation models, as long as they can be probabilistically described. This might include diffraction, blur, or even compression and demosaicing artefacts. Maybe most importantly, DIVNOISING can not only produce competitive and diverse results, but these results can also be leveraged for downstream processing. We have seen that cell segmentation can be improved and that clustering our results provides us with meaningful alternative interpretations of the same data (see Fig. 4 ). We believe that this is a highly promising direction for many applications, as it provides us with a way to account for the uncertainty introduced by the imaging process. We are looking forward to see how DIVNOISING will be applied and extended by the community, showcasing the true potential and limitations of this approach. We consistently use 8-fold data augmentation (rotation and flipping) in all experiments. All networks are trained with a batch size of 32 and an initial learning rate of 0.001. The learning rate is multiplied by 0.5 if the validation loss does not decrease for 30 epochs. For all datasets other than MNIST and KMNIST, we extract training patches of size 128 ˆ128, and separate 15% of all patches for validation. We set the maximum number of epochs such that approximately 22 million steps are performed, and in each epoch the entire training data is being fed. Training is terminated if the validation loss does not decrease by at least 10 ´6 over 100 epochs. For DenoiSeg Flywing we observed KL vanishing and solved it via Annealing within the first 15 epochs (Bowman et al., 2015) . The fully unsupervised DIVNOISING decoder directly predicts the signal and the noise variance per pixel where the variance is constrained to linearly depend on the signal (See Section 5). To avoid numerical problems and ensure that the predicted variance always remains positive, we allow the user to set a minimum allowed variance/standard deviation σ 2 min /σ min , and enforce this by clamping the predicted values. Note that a viable choice for this parameter depends on the intensity range of the dataset. We use the following values: For all FU-PN2V datasets σmin " 50, for DenoiSeg Flywing and DenoiSeg Mouse datasets σmin " 3, for DenoiSeg Mouse s&p dataset σmin " 1, for BioID Face dataset σmin " 15, for W2S avg 1 datasets σmin " 25 and for W2S avg 16 datasets σmin " 3. Run Time and Hardware Requirements. DIVNOISING using light weight fully convolutional networks (see Appendix Figs. 6 and 7 ) runs on relatively cheap computational budget. Our depth 2 networks trained for all experiments requires about 1.8 GB GPU memory and our depth 3 networks roughly 5 GB GPU memory on a NVIDIA TITAN Xp GPU. The training time varied from 5 ´12 hours on average depending on the dataset. Figure 7 : The fully convolutional architecture used for depth 3 networks. We show the depth 3 DIVNOISING network architecture used for DenoiSeg Flywing and eBook datasets. These networks count about 700k parameters and have a GPU memory footprint of approximately 5GB on a NVIDIA TITAN Xp.

A.4 CLUSTERING OF SOLUTIONS AND DERIVING THE MAP ESTIMATE

Here we provide additional details on how the cluster centers and the approximate MAP estimate of Fig. 4 (see main text) were found. We first drew 10000 sampled images from the approximate posterior as described in Section 4 of the main text. We then performed mean shift (Cheng, 1995) clustering (using the existing scipy implementation) on the cropped image region shown in the figure. We set a bandwidth of 800 and the the maximum number of iterations to 20, and used the 100 first samples of DIVNOISING as seeds. We finally show 9 of the resulting cluster centers in the figure . To produce the MAP estimate, we employ a similar strategy. In order to find the mode of the sampled distribution efficiently, we assume that dependencies in the predicted samples should be local. This assumption is valid, since our network only has only a finite receptive field for each predicted pixel. Hence, we apply mean shift algorithm on locally overlapping regions. We use a window size of 10 ˆ10 pixels with an overlap of 3 pixels in x and y. On each such region, the mean shift algorithm is executed repeatedly with decreasing bandwidth, always using the latest result as new seed. We start by using the sample mean as seed and with an initial bandwidth of 200. After each iteration the bandwidth is decreased by a factor of 0.9, until it drops below 100. Similar results should also be achievable by applying mean shift algorithm on the entire image. But since samples will differ at any location in the image, this global approach would require an excessively large number of DIVNOISING samples.

A.5 GENERATING IMAGES WITH DIVNOISING MODELS

Just as with a vanilla VAE (see Section 4 in the main text), we can use a trained DIVNOISING VAE to synthesise images of structures resembling the training data. To achieve this, we sample from the normal distribution z k " ppzq and process each sample with the decoder network s k " g θ pz k q. We show such generated images in comparison to real crops from the test data in Appendix Figs. 8 to 10. We see that the images appear most plausible for local structures, indicating that the small networks we use in this work are not capable of capturing larger structural features in the given data. 2020); Prakash et al. (2020) . DIVNOISING can also be used to generate images by sampling from the unit normal distribution ppzq and then using the decoder to produce an image. Here, we compare generated images and randomly cropped real images. We show images of different resolutions to see how well the VAE captures structures at different scales. The VAE we use for denoising is only able to realistically capture small local structures. Note that the network we use is quite shallow (see Appendix Fig. 6 ). 2020) dataset. DIVNOISING can also be used to generate images by sampling from the unit normal distribution ppzq and then using the decoder to produce an image. Here, we compare generated images and randomly cropped real images. We show images of different resolutions to see how well the VAE captures structures at different scales. Note that the network (see Appendix Fig. 7 ) we use is a bit deeper compared to Supplementary Fig. 8 . This VAE captures larger structures a little better but struggles to produce crisp high frequency structures. This is likely a consequence of the increased depth of the used network. 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2  1 0 2 4   0 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2  1 0 2 4   0 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2  1 0 2 4   0 Van Rijsbergen, 1979) , Jaccard score (Jaccard, 1901) and Average Precision (Lin et al., 2014) . On the x-axis we plot the number of DIVNOISING samples used. The performance of BIC is only evaluated up to 100 samples because we limited run-time to 30 minutes). Remarkably, Consensus (Avg) using only 30 DIVNOISING segmentation labels, outperforms segmentations obtained from high SNR images.

A.6 INSTANCE CELL SEGMENTATION

Here, we provide additional details regarding the downstream segmentation task described in Section 6 of the main text. We used the first 21 images in the test set of DenoiSeg Flywing for our analysis. Given an input image, our segmentation pipeline consists of piq generating segmentation masks using local thresholding with a mean filter of radius 15, followed by piiq skeletonizing the space between these masks, followed by piiiq connected component analysis to obtain instance segmentation. Using this pipeline, we generated segmentation for the noisy (low SNR) images, ground truth (high SNR) images, as well as for the DIVNOISING MMSE estimate (obtained by averaging 1000 sampled denoised images). We also apply the above described pipeline for each of the 1000 DIVNOISING samples separately to serve as input for the two label fusion methods, namely piq Consensus (BIC), and piiq Consensus (Avg). For the latter label fusion method we skip the connected component analysis and directly average the thresholded and skeletonized images. To obtain the final result, we again apply the full segmentation pipeline described above to this average image. All segmentations were obtained with the open source image analysis software Fiji (Schindelin et al., 2012) . The quantitative results illustrating the benefit of diverse segmentation for label fusion methods is shown in Appendix Fig. 11 .

A.7 THE RELATIVE IMPORTANCE OF THE KL LOSS COMPONENT

We can generalize our DIVNOISING training loss as a weighted combination of a modified reconstruction loss (see Section 5 in the main text) and KL divergence loss, where the two loss components are weighted equally. Following the exposition in (Higgins et al., 2017) , we explore the effect of weighting the KL loss component during training with a factor β. Our modified training loss thus becomes L φ,θ pxq " L R φ,θ pxq `βL KL φ pxq, where setting β " 1 gives our DIVNOISING setup described in Section 5 in the main text. Note that increasing or reducing β, i.e. changing the relative importance of the reconstruction loss, is equivalent to using a wider or narrower noise model, such as a Gaussian noise model with larger or smaller standard deviation σ. We can thus interpret above results as the effect of using a mismatched noise model that is either too wide or too narrow. Effect of β on Denoising Quality. We investigated the effect of β on the denoising ability of DIVNOISING network with the DenoiSeg Flywing dataset. As illustrated in Appendix Fig. 12a,  β " 1 gives the optimal results for the MMSE estimate (obtained by averaging 1000 samples). Both regimes, β ą 1 and β ă 1, yield sub-par denoising performance. Effect of β on Diversity of Denoised Samples. We introduce a simple new metric, called standard deviation PSNR, to quantify the diversity of denoised results obtained as a function of β. For a given noisy image x and given a set of denoised samples S x , we compute the PSNR of each sample a P S x with respect to the corresponding ground truth image s. This yields a vector of PSNR values v where v i " P SN Rps, a i q, for v i P v. Standard deviation PSNR for the noisy image x is then defined as the standard deviation of elements in the vector v. Appendix Fig. 12b reports the average of standard deviation PSNR obtained for 42 test images of the DenoiSeg Flywing dataset. The higher the beta, the higher is the standard deviation PSNR indicating higher diversity. Qualitative results presented in Appendix Fig. 13 show that with β ą 1, there is an increased diversity at the bigger image scales (e.g. diverse predictions of cell membranes), and generated denoised images appear smoother than those observed in real data. Setting β ă 1 reduces diversity and introduces grainy artefacts, thereby yielding poor reconstructions. Note that β " 1 gives the best results in terms of PSNR of MMSE while maintaining a fair level of diversity. A.8 HOW DOES NOISE AFFECT THE DIVERSITY OF DIVNOISING SAMPLES? We quantified how the diversity of DIVNOISING samples changes with the amount of noise present in the original dataset. Increased level of noise introduces additional uncertainty about the true signal, hence we would expect this to lead to increasingly diverse samples. To test this hypothesis, we choose the DenoiSeg Flywing dataset and inject pixel wise independent gaussian noise of mean 0 and standard deviations σ " 30, 50 and 70. We report the standard deviation PSNR diversity metric, introduced in Appendix Section A.7, for all three noise levels. As demonstrated in Appendix Fig. 12c , the higher the noise level, the more diverse the DIVNOISING samples become, thereby confirming our hypothesis. The heatmap shows how the quality (PSNR in db) of DIVNOISING MMSE estimate changes with averaging increasingly larger number of samples (numbers shown for 1 run). Unsurprisingly, the more samples are averaged, the better the results get. We also investigate the effect of weighting the KL loss term with a factor β (Supplementary Eq. 5) on the quality of reconstruction. We observe that the usual VAE setup with β " 1 gives the best results in terms of reconstruction quality. Increasing β ą 1 leads to higher diversity at the expense of poor reconstruction (see Appendix Fig. 13 .) (b) We quantify the denoising diversity achieved with different β values in terms of standard deviation PSNR (see Appendix section A.7 for details on the metric). We report the average standard deviation of PSNRs over all test images for different values of β and observe that the higher β values increase the diversity. (c) We also investigate the effect of noise on the diversity of denoised DIVNOISING samples by adding pixel wise independent zero mean Gaussian noise of standard deviations 30, 50 and 70. The higher the noise, the more ambiguous the noisy input images are, thus leading to higher diversity. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results. We investigated the denoising performance of DIVNOISING network on the natural images benchmark dataset BSD68 (Roth & Black, 2005) and show our results in Appendix Fig. 26 , where the input has been corrupted with Gaussian noise of σ " 25. With our depth 2 network having 96 feature channels in the first network layer, we achieve a PSNR of 27.45 dB while our unsupervised NOISE2VOID baseline gives 27.71 dB. As discussed in the main text, this does not come as a surprise since our DIVNOISING network is comparatively small and asked to learn a complete generative model of the entire data domain (see main text and Appendix Figs. 8-10). Learning such a model for the tremendous diversity present in natural images is challenging, and likely the reason why other architectures solving problems posed on the domain of natural images are much larger than our networks are. Future versions of DIVNOISING will address this issue by using more expressive architectures. However, DIVNOISING already gives us access to clean samples from the true (data) posterior (see Appendix Fig. 26 ). Following the idea introduced in (Kingma & Welling, 2014), we overcome this problem by instead using an encoder to describe an auxiliary distribution q φ pz|xq. The encoder can take a noisy image x and yield a distribution over z values, which in turn are likely to produce x under the generative model. We want the encoder distribution q φ pz|xq to approximate the true underlying distribution q φ pz|xq « p θ pz|xq, as it is implicitly described by our graphical model. From Bayes theorem, p θ pz|xq factorizes as p θ pz|xq " p θ px|zqppzq p θ pxq . ( ) The decoder in DIVNOISING setup is a deterministic function of z, i.e., g θ pzq " s. Hence, we can reformulate Supp. Eq. 10 as p θ pz|xq " p NM px|s " g θ pzqqppzq p θ pxq . ( ) We can describe the quality of the encoder distribution, i.e. how well it approximates the true p θ pz|xq via the KL divergence KL pq φ pz|xq||p θ pz|xqq " ´ż q φ pz|xq log p θ pz|xq q φ pz|xq dz. Substituting Supp. Eq. 11 in Supp. Eq. 12, we get KL pq φ pz|xq||p θ pz|xqq " ´ż q φ pz|xq log p NM px|s " g θ pzqqppzq p θ pxqq φ pz|xq dz " ´ż q φ pz|xqrlog p NM px|s " g θ pzqqppzq q φ pz|xq ´log p θ pxqsdz " ´ż q φ pz|xq log p NM px|s " g θ pzqqppzq q φ pz|xq dz `ż q φ pz|xq log p θ pxqdz " ´ż q φ pz|xq log p NM px|s " g θ pzqqppzq q φ pz|xq dz `log p θ pxq ż q φ pz|xqdz. Since ş q φ pz|xqdz " 1, we get KL pq φ pz|xq||p θ pz|xqq " ´ż q φ pz|xq log p NM px|s " g θ pzqqppzq q φ pz|xq dz `log p θ pxq. This implies log p θ pxq " ż q φ pz|xq log p NM px|s " g θ pzqqppzq q φ pz|xq dz `KL pq φ pz|xq||p θ pz|xqq " ELBO `KL pq φ pz|xq||p θ pz|xqq , where ELBO is the Evidence Lower Bound as also introduced in (Kingma & Welling, 2019) in the context of standard VAEs and here ELBO " ş q φ pz|xq log pNMpx|s"g θ pzqqppzq q φ pz|xq dz. Note that the KL divergence term in Supp. Eq. 13 is always greater than or equal to 0 and hence, ELBO is a lower bound for log p θ pxq, i.e., log p θ pxq ě ELBO. It follows from Supp. Eq. 13 that ELBO " log p θ pxq ´KL pq φ pz|xq||p θ pz|xqq (14) Supp. Eq. 14 implies that maximizing ELBO with respect to φ and θ maximizes log p θ pxq and minimizes KL pq φ pz|xq||p θ pz|xqq, the goals we seek to achieve. Hence, max ELBO " max ˆż q φ pz|xq log p NM px|s " g θ pzqqppzq q φ pz|xq dz " max ˆż q φ pz|xq log p NM px|s " g θ pzqqdz `ż q φ pz|xq log ppzq q φ pz|xq dz " max ˆż q φ pz|xq log p NM px|s " g θ pzqqdz ´KL pq φ pz|xq||ppzqq " max `Eq φ pz|xq rlog p NM px|s " g θ pzqqs ´KL pq φ pz|xq||ppzqq ˘. Maximizing the ELBO is equivalent to minimizing the negative ELBO, thus giving us the DIVNOISING loss function L φ,θ pxq " minpE q φ pz|xq r´log p NM px|s " g θ pzqqs `KL pq φ pz|xq||ppzqqq, where the expected value is approximated in each iteration by drawing a single sample from q φ pz|xq. Note that the first term in the summation in Supp. Eq. 15 is the same as described in Section 5 in the main text whereas the second term in the summation is the same as used in the standard VAE loss.

A.13 COMPARISON OF PREDICTED VARIANCES BY VARIOUS METHODS

Unsupervised DIVNOISING and vanilla VAEs are both trained fully unsupervised, learning to predict per pixel noise models. Learning a good noise model is essential for good denoising performance as evident from Table 1 . Here, we compare the noise models and variance maps predicted for two datasets by our unsupervised DIVNOISING and vanilla VAEs. BioID Face dataset. This dataset has been synthetically corrupted with Gaussian noise of µ " 0 and σ " 15. 



Supervised methods using perfect GT will outperform DIVNOISING, but GT data is at times not perfect. Sample 1 -Sample 3 Sample 2 -Sample 3 Sample 1 -Sample 2 Sample 1 -Sample 3 Sample 2 -Sample 3



Figure 1: Training and prediction/inference with DIVNOISING. (top) A DivNoising VAE can be trained fully unsupervised, using only noisy data and a (measured, bootstrapped, or co-learned) pixel noise model pNMpxi|siq (see main text for details). (bottom) After training, the encoder can be used to sample multiple z k " q φ pz|xq, giving rise to diverse denoised samples s k . These samples can further be used to infer consensus point estimates such as a MMSE or a MAP solution.

Figure 2: Qualitative denoising results. We compare two DIVNOISING samples, the MMSE estimate (derived by averaging 1000 sampled images), and results by the supervised CARE baseline. The diversity between individual samples is visualized in the column of difference images. (See Appendix A.9 for additional images of DIVNOISING results.)

Figure 3: Sensibility of Noise Models. For each predicted signal intensity (x-axis), we show the variance of noisy observations (yaxis). The plot is generated from experiments on the Convallaria dataset. The dashed red line shows the true noise distribution (measured from pairs of noisy and clean calibration data). This true distribution, as well as the noise model created via bootstrapping, and the noise model we co-learned with DIVNOISING, show simple (approximately) linear relationships between signal intensities and noise variance. Such a relationship is known to coincide with the physical reality of Poisson noise (shot noise) (Zhang et al., 2019). The implicitly learned noise model of the vanilla VAE has to independently predict the noise variance for each pixel. Its predictions clearly deviate from the true linear relationship. See Appendix A.13 for results on BioID Face dataset and more details.

Figure 4: Exploring the learned posterior. The MMSE estimate (average of 10k samples) shows faintlyoverlaid letters as a consequence of ambiguities in noisy input. Among these samples from the posterior, we use mean shift clustering (on smaller crops) to identify diverse and likely points in the posterior. We show 9 such cluster centers in no particular order. We also obtain an approximate MAP estimate (see Supplementary Material), which has most artifacts of the MMSE solution removed. DivNoising Sample 1 + Seg DivNoising Sample 2 + Seg. DivNoising Sample 3 + Seg. DivNoising Sample 4 + Seg.

Figure 5: DIVNOISING enables downstream segmentation. We show input images (upper row) and results of a fixed (untrained) segmentation pipeline (lower row). Cells that were segmented incorrectly (merged or split) are indicated in magenta. While segmentations of the noisy raw data are of very poor quality, sampled DIVNOISING results give rise to much better and diverse solutions (cols. 2-5). We then use two label fusion methods to find consensus segmentations (col. 6), which are even outperforming segmentation results on high SNR (GT) images. Quantitative results are presented in Appendix Fig. 11.

Figure 8: Generating synthetic images with the DIVNOISING VAE for the FU-PN2V Convallaria dataset Krull et al. (2020);Prakash et al. (2020). DIVNOISING can also be used to generate images by sampling from the unit normal distribution ppzq and then using the decoder to produce an image. Here, we compare generated images and randomly cropped real images. We show images of different resolutions to see how well the VAE captures structures at different scales. The VAE we use for denoising is only able to realistically capture small local structures. Note that the network we use is quite shallow (see Appendix Fig.6).

Figure 9: Generating synthetic images with the DIVNOISING VAE for the DenoiSeg Flywing Buchholz et al. (2020) dataset. DIVNOISING can also be used to generate images by sampling from the unit normal distribution ppzq and then using the decoder to produce an image. Here, we compare generated images and randomly cropped real images. We show images of different resolutions to see how well the VAE captures structures at different scales. Note that the network (see Appendix Fig.7) we use is a bit deeper compared to Supplementary Fig.8. This VAE captures larger structures a little better but struggles to produce crisp high frequency structures. This is likely a consequence of the increased depth of the used network.

Figure 10: Generating synthetic images with the DIVNOISING VAE for the MNIST LeCun et al. (1998) dataset. DIVNOISING can also be used to generate images by sampling from the unit normal distribution ppzq and then using the decoder to produce an image. Here, we compare generated images and random ground truth images. Our fully convolutional architecture allows us to generate images of different sizes (despite all input images being only of size 28 ˆ28).

Figure 11: DIVNOISING enables downstream segmentation. Evaluation of segmentation results (using the F1 score(Van Rijsbergen, 1979), Jaccard score(Jaccard, 1901) and Average Precision(Lin et al., 2014). On the x-axis we plot the number of DIVNOISING samples used. The performance of BIC is only evaluated up to 100 samples because we limited run-time to 30 minutes). Remarkably, Consensus (Avg) using only 30 DIVNOISING segmentation labels, outperforms segmentations obtained from high SNR images.

Figure 12: Analyzing the denoising quality and diversity of DIVNOISING samples with different factors for the DenoiSeg Flywing dataset. (a)The heatmap shows how the quality (PSNR in db) of DIVNOISING MMSE estimate changes with averaging increasingly larger number of samples (numbers shown for 1 run). Unsurprisingly, the more samples are averaged, the better the results get. We also investigate the effect of weighting the KL loss term with a factor β (Supplementary Eq. 5) on the quality of reconstruction. We observe that the usual VAE setup with β " 1 gives the best results in terms of reconstruction quality. Increasing β ą 1 leads to higher diversity at the expense of poor reconstruction (see Appendix Fig.13.) (b) We quantify the denoising diversity achieved with different β values in terms of standard deviation PSNR (see Appendix section A.7 for details on the metric). We report the average standard deviation of PSNRs over all test images for different values of β and observe that the higher β values increase the diversity. (c) We also investigate the effect of noise on the diversity of denoised DIVNOISING samples by adding pixel wise independent zero mean Gaussian noise of standard deviations 30, 50 and 70. The higher the noise, the more ambiguous the noisy input images are, thus leading to higher diversity.

Figure 13: Qualitative analysis of the effect of weighting KL loss term with factor β for De-noiSeg Flywing dataset. (a) We show the DIVNOISING MMSE estimate obtained by averaging 1000 samples for all considered β values (Supplementary Eq. 5). We observe that the reconstruction quality suffers on either increasing β ą 1 or decreasing β ă 1. Best results (with respect to PSNR) are obtained with β " 1, as demonstrated in Fig. 12a. (b) For each β value, we show three randomly chosen DIVNOISING samples as well as difference images. Increasing β ą 1, allows the DIVNOISING network to generate structurally very diverse denoised solutions, while typically leading to textural smoothing. Decreasing β ă 1 generates DIVNOISING samples with overall much reduced structural diversity, introducing reconstruction artefacts/structures at smaller scales.

Figure 15: Additional qualitative results for the DenoiSeg Flywing Buchholz et al. (2020) dataset.Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 19: Additional qualitative results for the W2S Zhou et al. (2020) dataset (ch. 0, avg1).Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 21: Additional qualitative results for the W2S Zhou et al. (2020) dataset (ch. 2, avg1).Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 22: Additional qualitative results for the MNIST LeCun et al. (1998) dataset. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 23: Additional qualitative results for theKMNIST Clanuwat et al. (2018)  dataset. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 25: Additional qualitative results for the BioID Face noa dataset. Here, we show qualitative results for two cropped regions (green and cyan). The MMSE estimate was produced by averaging 1000 sampled images. We choose 3 samples to display to illustrate the diversity of DIVNOISING results.

Figure 28: Comparison of noise models and variance maps predicted by the vanilla VAE and DIVNOISING. (a) For each predicted signal intensity (x-axis), we show the variance of noisy observations (y-axis). The plot is generated from experiments on the BioID Face dataset. The dashed red line shows the true noise distribution (Gaussian noise with σ 2 " 225). The noise model created via bootstrapping, and the noise model we co-learned with DIVNOISING, correctly show (approximately) constant values across all signal intensities. The implicitly learned noise model of the vanilla VAE has to independently predict the noise variance for each pixel. Its predictions clearly deviate from the true constant noise variance. (b) We visually compare the denoising results and show how the predicted variance varies across the image. While the variance predicted by the implicitly co-learned vanilla VAE model varies depending on the image content, the variance predicted by the co-learned DIVNOISING model correctly remains flat.

Mouse s&p 32.98˘0.020 23.62˘0.084 35.19˘0.030 29.67˘0.079 36.21˘0.015 37.03˘0.016 BioID Face 32.34˘0.080 32.58˘0.022 33.02˘0.020 33.76˘0.079 33.12˘0.039 35.06˘0.051 Quantitative results.

Table 3 in Appendix A.15 for precise sampling times). The effect of averaging a different number of samples is explored in Appendix A.8.

all   Comparison of Self2Self with DivNoising. We train Self2Self (S2S) on a random single image per dataset and compare it with DIVNOISING trained on the same single image (DivN. 1 ) and DIVNOISING trained with the full dataset (DivN. all ). All methods are tested on the selected single image. Overall best method is indicated in bold. For all datasets, DIVNOISING leads to best performance while being orders of magnitude faster and needing significantly less GPU memory.A.15 SAMPLING TIME DURING PREDICTIONDuring prediction, in order to obtain diverse results, or to compute the MMSE or MAP estimates, we need to sample multiple denoised images from the trained DIVNOISING posterior. Table3reports the time (in seconds) needed for sampling 1000 denoised images. For all datasets holds that sampling 1000 denoised images requires less than 7 seconds.

Sampling times with DIVNOISING. Average time (˘SD) needed to sample 1000 denoised images from a trained DIVNOISING network (evaluated over all test images of the respective dataset).

availability

A.9 ADDITIONAL RESULTS More Qualitative Results. In addition to the qualitative results presented in Fig. 2 in the main text, here we present more results for each considered dataset in Appendix Figs. A.9-24. Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 DivNoising MMSE Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Published as a conference paper at ICLR 2021 Input MMSE Ground truth Input MMSE Ground truth Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Sample 1 -Sample 2 Sample 1 -Sample 3 Sample 2 -Sample 3 Sample 1 -Sample 2 Sample 1 -Sample 3 Sample 2 -Sample 3

A APPENDIX

A.1 INTRINSICALLY NOISY MICROSCOPY DATA We use public microscopy datasets which show realistic levels of noise, introduced by the respective optical imaging setups. The FU-PN2V Convallaria (Krull et al., 2020; Prakash et al., 2020) data, consists of 100 noisy calibration images (intended to generate a noise model), and 100 images of size 1024 ˆ1024 showing a noisy Convallaria section. The FU-PN2V Mouse nuclei (Prakash et al., 2020) data is composed of 500 noisy calibration images and 200 noisy images of size 512 ˆ512 showing labeled cell nuclei. The FU-PN2V Mouse actin (Prakash et al., 2020) data from the same source consists of 100 noisy calibration images and 100 noisy images of size 1024 ˆ1024 of the same sample, but labeled for the protein actin. Finally, we use all 3 channels of 2 noise levels (avg1 and avg16) of the W2S (Zhou et al., 2020) data. For each channel, corresponding high quality (ground truth) images are available. Each channel's training and test sets consist of 80 and 40 images, respectively. All images are 512 ˆ512 pixels in size.

A.2 DATA EXPOSED TO SYNTHETIC NOISE

We use the well known MNIST (LeCun et al., 1998) as well as the KMNIST (Clanuwat et al., 2018) dataset showing 28 ˆ28 images of handwritten digits and phonetic letters of hiragana, respectively. Both datasets contain 60000 training examples and 10000 test examples. Onto both datasets we added pixel-wise independent Gaussian noise with µ " 0 and σ " 140. As a third text-based dataset we rendered the freely available eBook "The Beetle" (Marsh, 2004) and extracted 40800 image patches of size 128 ˆ128. We separated 34680 patches for training and 6120 patches for validation, and added pixel-wise independent Gaussian noise with µ " 0 and σ " 255. Additionally, we use three datasets from microscopy. The DenoiSeg Mouse (Buchholz et al., 2020) data, showing cell nuclei in the developing mouse skull, consists of 908 training and 160 validation images of size 128 ˆ128, with additional 67 images of size 256 ˆ256 for testing. Two noisy datasets were created with this data, one by exposing all images to pixel-wise independent Gaussian noise with µ " 0 and σ " 20 and another one by first applying poisson noise with λ " 1 followed by adding gaussian noise with µ " 0 and σ " 10 followed by randomly changing 3% of pixels to either 0 or 255. This dataset is called Mouse s&p in Table 1 . The DenoiSeg Flywing (Buchholz et al., 2020) data is showing membrane labeled cells in a fly wing, consisting of 1428 training and 252 validation patches of size 128 ˆ128, with additional 42 images of size 512 ˆ512 for testing. We exposed this data to pixel-wise independent Gaussian noise with µ " 0 and σ " 70 to create a synthetic low SNR version. All original datasets are 8-bit. Lastly, we randomly select 500 images of size 384 ˆ286 from BioID Face recognition database (noa) and corrupt them with pixel-wise independent Gaussian noise with µ " 0 and σ " 15. We use 340, 60 and 100 images for training, validation and test respectively.

A.3 TRAINING AND NETWORK DETAILS

Here, we provide additional details about the network architecture and training parameters used throughout the main manuscript. For all DIVNOISING experiments, we use rather lightweight depth 2 and depth 3 VAE architectures (see Appendix Figs. 6 and 7, respectively) . All networks use a single input channel and 32 feature channels in the first network layer except for the network trained on mouse s&p dataset which uses 96 feature channels in the first network layer. We use two 3 ˆ3 convolutions (with padding 1), each followed by ReLU activation, followed by a 2 ˆ2 max pooling layer. After each such downsampling step, we double the number of feature channels. For all experiments we use a network architecture of depth 2 (with 2 down/upsampling steps). The only exceptions are our experiments on DenoiSeg Flywing and eBook data, for which we use a depth 3 architecture (with 3 down/upsampling steps). In total, our depth 2 networks have only around 200k parameters and depth 3 networks have around 700k parameters.While we generally use a VAE bottleneck of 64 latent space feature dimensions for each pixel of the image (after encoding), for the small 28 ˆ28 MNIST and KMNIST images we use only 8 such latent space dimensions. Our relatively small DIVNOISING networks fail to capture the ample structural diversity present in natural photographic images thereby exhibiting sub-par performance. However, diversity at adequately small image scales (with respect to the used network's capabilities) can still be observed, as demonstrated with the different samples and the difference images corresponding to the green and cyan insets. We are confident that future work on DIVNOISING with larger networks and different network architectures/training schedules will expand the capabilities of this method to capture more complex image domains.

A.11 HOW ACCURATE IS THE DIVNOISING MODEL AND THE APPROXIMATE POSTERIOR?

Upon close inspection, we find that the images sampled by DIVNOISING exhibit various imperfections, making clear that they are in fact only samples from an approximate posterior.For example, we find that DIVNOISING samples are often smoother than real images, see e.g. Appendix Figs. 15 and 22. We attribute this problem to our network architecture (see also Appendix Section A.9. For instance, a U-NET based supervised denoiser can make use of skip connections to propagate high frequency information. But DIVNOISING VAEs have to pipe all information through the downsampled latent variable bottleneck.Another common artefact in sampled images is the presence of faint overlayed structures in the background (see Suppl. Fig. 24 ). Note that this artefact is less pronounced than in the MMSE estimate (where we expect such artefacts).We believe that most of these remaining issues will be solved/reduced by using more sophisticated network architectures and refined training schedules.A.12 DERIVATION OF DIVNOISING LOSS FUNCTION FROM PROBABILITY MODEL PERSPECTIVE Here, we want to provide a more formal derivation of why our loss function can be used to train the VAE as desired. We follow a similar line of argument as has been laid out for the standard VAE by Doersch in (Doersch, 2016) .In our framework, we assume that the observed data x is generated from some underlying latent variable z through some clean signal s via a known noise model p NM px|sq. This process of data generation is depicted as a graphical model shown in Appendix Fig. 27 . The decoder describes a full joint model for all three variables:In the assumed graphical model in (Appendix Fig. 27 ) x is conditionally independent of z given s.Formally, this implies that ppx|s, zq " p NM px|sq.Using Supp. Eq. 7, we can reformulate Supp. Eq. 6 asTo train the generative model from Appendix Fig. 27 we try to adjust the parameters θ to maximize the likelihood of observing our training data x. This means that we need to maximizeHowever, computing the integral in Supp. Eq. 9 is intractable due to the high dimensionality of z. In our particular model, we would need to integrate over 64 dimensions for each pixel for all our datasets except MNIST and KMNIST datasets where we would need to integrate over 8 dimensions for each pixel. An alternative to computing the integral would be to approximate it by sampling a large number of values z 1 , z 2 , ..., z K from ppzq and computing p θ pxq « 1 K ř K k"1 p NM px|s " g θ pz k qq. However, since p NM px|s " g θ pz k qq will be very close to 0 for almost all z k , this would require K to be a very large number for each image in our training set.Convallaria dataset. This dataset is intrinsically noisy and the noise distribution resembles the shot noise and read out noise characteristics as typical for images acquired under low light settings. Since SELF2SELF is trained per image, leading to prohibitive computation times on our test sets, we randomly chose single images for four of our datasets (FU-PN2V Convallaria, FU-PN2V Mouse actin, FU-PN2V Mouse nuclei and W2S Ch.1 (avg1)) which contain real-world noise.We compare the performance of SELF2SELF trained on single images with the performance of DIVNOISING when trained on piq the same single image as SELF2SELF, and piiq the entire body of available noisy data in the respective dataset. All trained networks are then applied to the selected single images. Note that SELF2SELF is run with its default settings.Since SELF2SELF training is computationally expensive even for a single image, we decided to limit training time to 10 hours per input on a NVIDIA TITAN Xp GPU. We monitored its performance by periodically computing the PSNR (every 3000 training steps), showing that even after 10 hours, SELF2SELF is not yet fully converged. Table 2 shows all results we obtained. It can be seen that DIVNOISING, when trained on the full dataset, leads consistently to better performance, while DIVNOISING trained on single images leads to comparable results in a fraction of training time and using significantly less GPU memory.

