SCALING LAWS FOR DEEP LEARNING BASED IMAGE RECONSTRUCTION

Abstract

Deep neural networks trained end-to-end to map a measurement of a (noisy) image to a clean image perform excellent for a variety of linear inverse problems. Current methods are only trained on a few hundreds or thousands of images as opposed to the millions of examples deep networks are trained on in other domains. In this work, we study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while simultaneously scaling the network size. For all three tasks we find that an initially steep powerlaw scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance. To understand the expected behavior, we analytically characterize the performance of a linear estimator learned with early stopped gradient descent. The result formalizes the intuition that once the error induced by learning the signal model is small relative to the error floor, more training examples do not improve performance.

1. INTRODUCTION

Deep neural networks trained to map a noisy measurement of an image or a noisy image to a clean image give state-of-the-art (SOTA) performance for image reconstruction problems. Examples are image denoising (Burger et al., 2012; Zhang et al., 2017; Brooks et al., 2019; Liang et al., 2021 ), super-resolution (Dong et al., 2016; Ledig et al., 2017; Liang et al., 2021) , and compressive sensing for computed tomography (CT) (Jin et al., 2017) , and accelerated magnetic resonance imaging (MRI) (Zbontar et al., 2018; Sriram et al., 2020; Muckley et al., 2021; Fabian & Soltanolkotabi, 2022) . The performance of a neural network for imaging is determined by the network architecture and optimization, the size of the network, and the size and quality of the training set. Significant work has been invested in architecture development. For example, in the field of accelerated MRI, networks started out as convolutional neural networks (CNN) (Wang et al., 2016) , which are now often used as building blocks in un-rolled variational networks (Hammernik et al., 2018; Sriram et al., 2020) . Most recently, transformers have been adapted to image reconstruction (Lin & Heckel, 2022; Huang et al., 2022; Fabian & Soltanolkotabi, 2022) . However, it is not clear how substantial the latest improvements through architecture design are compared to potential improvements expected by scaling the training set and network size. Contrary to natural language processing (NLP) models and modern image classifiers that are trained on billions of examples, networks for image reconstruction are only trained on hundreds to thousands of example images. For example, the training set of SwinIR (Liang et al., 2021) , the current SOTA for image denoising, contains only 10k images, and the popular benchmark dataset for accelerated MRI consists only of 35k images (Zbontar et al., 2018) . In this work, we study whether neural networks for image reconstruction only require moderate amounts of data to reach their peak performance, or whether major boosts are expected from increasing the training set size. To partially address this question, we focus on three problems: image denoising, reconstruction from few and noisy measurements (compressive sensing) in the context of accelerated MRI, and super-resolution. We pick Gaussian denoising for its practical importance and since it can serve as a building block to solve more general image reconstruction problems well (Venkatakrishnan et al., 2013) . We pick MR reconstruction because it is an important instance of a compressive sensing problem, and many problems can be formulated as compressive sensing problems, for example super-resolution and in-painting. In addition, for MR reconstruction the question on how much data is needed is particularly important, since it is expensive to collect medical data. For the three problems we identify scaling laws that describe the reconstruction quality as a function of the training set size, while simultaneously scaling network sizes. Such scaling laws have been established for NLP and classification tasks, as discussed below, but not for image reconstruction. The experiments are conducted with a U-Net (Ronneberger et al., 2015) and the SOTA SwinIR (Liang et al., 2021) , a transformer architecture. We primarily consider the U-Net since it is widely used for image reconstruction and acts as a building block in SOTA models for image denoising (Brooks et al., 2019; Gurrola-Ramos et al., 2021; Zhang et al., 2021; 2022) and accelerated MRI (Zbontar et al., 2018; Sriram et al., 2020) . We also present results for denoising with the SwinIR. The SwinIR outperforms the U-Net, but we find that its scaling with the number of training examples does not differ notably. Our contributions are as follows: • Empirical scaling laws for denoising. We train U-Nets of sizes 0.1M to 46.5M parameters with training set sizes from 100 to 100k images from the ImageNet dataset (Russakovsky et al., 2015) for Gaussian denoising with 20.17dB Peak-Signal-to-Noise ratio (PSNR). While for the largest training set sizes and network sizes we consider, performance continues to increase, the rate of performance increase for training set sizes beyond a few thousand images slows to a level indicating that even training on millions of images only yields a marginal benefit, see Fig. 1 (a). We also train SOTA SwinIRs of sizes 4M to 129M parameters on the same range of training set sizes, see Fig. 2 (a). Albeit it performs better than the U-Net, its scaling behavior is essentially equivalent, and again after a few thousand images, only marginal benefits are



Figure 1: Empirical scaling laws for CNN-based image reconstruction. Reconstruction performance of a U-Net for denoising (a) and accelerated MRI (c) as a function of the training set size N . In both experiments an initial steep power law R = βN α transitions to a relatively flat one already at moderate N . Thus we expect that training on millions of images does not significantly improve performance. To obtain the scaling curves in (a),(c) we optimize over the number of network parameters as shown in (b),(d). Colors in the plots on the left and right correspond to the same training set size. Since for large training set sizes corresponding to the flattened scaling law, increasing the parameters does not boost performance further, the decay in scaling coefficients is a robust finding.

