SCALING LAWS FOR DEEP LEARNING BASED IMAGE RECONSTRUCTION

Abstract

Deep neural networks trained end-to-end to map a measurement of a (noisy) image to a clean image perform excellent for a variety of linear inverse problems. Current methods are only trained on a few hundreds or thousands of images as opposed to the millions of examples deep networks are trained on in other domains. In this work, we study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while simultaneously scaling the network size. For all three tasks we find that an initially steep powerlaw scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance. To understand the expected behavior, we analytically characterize the performance of a linear estimator learned with early stopped gradient descent. The result formalizes the intuition that once the error induced by learning the signal model is small relative to the error floor, more training examples do not improve performance.

1. INTRODUCTION

Deep neural networks trained to map a noisy measurement of an image or a noisy image to a clean image give state-of-the-art (SOTA) performance for image reconstruction problems. Examples are image denoising (Burger et al., 2012; Zhang et al., 2017; Brooks et al., 2019; Liang et al., 2021) , super-resolution (Dong et al., 2016; Ledig et al., 2017; Liang et al., 2021) , and compressive sensing for computed tomography (CT) (Jin et al., 2017) , and accelerated magnetic resonance imaging (MRI) (Zbontar et al., 2018; Sriram et al., 2020; Muckley et al., 2021; Fabian & Soltanolkotabi, 2022) . The performance of a neural network for imaging is determined by the network architecture and optimization, the size of the network, and the size and quality of the training set. Significant work has been invested in architecture development. For example, in the field of accelerated MRI, networks started out as convolutional neural networks (CNN) (Wang et al., 2016) , which are now often used as building blocks in un-rolled variational networks (Hammernik et al., 2018; Sriram et al., 2020) . Most recently, transformers have been adapted to image reconstruction (Lin & Heckel, 2022; Huang et al., 2022; Fabian & Soltanolkotabi, 2022) . However, it is not clear how substantial the latest improvements through architecture design are compared to potential improvements expected by scaling the training set and network size. Contrary to natural language processing (NLP) models and modern image classifiers that are trained on billions of examples, networks for image reconstruction are only trained on hundreds to thousands of example images. For example, the training set of SwinIR (Liang et al., 2021) , the current SOTA for image denoising, contains only 10k images, and the popular benchmark dataset for accelerated MRI consists only of 35k images (Zbontar et al., 2018) . In this work, we study whether neural networks for image reconstruction only require moderate amounts of data to reach their peak performance, or whether major boosts are expected from increasing In both experiments an initial steep power law R = βN α transitions to a relatively flat one already at moderate N . Thus we expect that training on millions of images does not significantly improve performance. To obtain the scaling curves in (a),(c) we optimize over the number of network parameters as shown in (b), (d) . Colors in the plots on the left and right correspond to the same training set size. Since for large training set sizes corresponding to the flattened scaling law, increasing the parameters does not boost performance further, the decay in scaling coefficients is a robust finding. the training set size. To partially address this question, we focus on three problems: image denoising, reconstruction from few and noisy measurements (compressive sensing) in the context of accelerated MRI, and super-resolution. We pick Gaussian denoising for its practical importance and since it can serve as a building block to solve more general image reconstruction problems well (Venkatakrishnan et al., 2013) . We pick MR reconstruction because it is an important instance of a compressive sensing problem, and many problems can be formulated as compressive sensing problems, for example super-resolution and in-painting. In addition, for MR reconstruction the question on how much data is needed is particularly important, since it is expensive to collect medical data. For the three problems we identify scaling laws that describe the reconstruction quality as a function of the training set size, while simultaneously scaling network sizes. Such scaling laws have been established for NLP and classification tasks, as discussed below, but not for image reconstruction. The experiments are conducted with a U-Net (Ronneberger et al., 2015) and the SOTA SwinIR (Liang et al., 2021) , a transformer architecture. We primarily consider the U-Net since it is widely used for image reconstruction and acts as a building block in SOTA models for image denoising (Brooks et al., 2019; Gurrola-Ramos et al., 2021; Zhang et al., 2021; 2022) and accelerated MRI (Zbontar et al., 2018; Sriram et al., 2020) . We also present results for denoising with the SwinIR. The SwinIR outperforms the U-Net, but we find that its scaling with the number of training examples does not differ notably. Our contributions are as follows: • Empirical scaling laws for denoising. We train U-Nets of sizes 0.1M to 46.5M parameters with training set sizes from 100 to 100k images from the ImageNet dataset (Russakovsky et al., 2015) for Gaussian denoising with 20.17dB Peak-Signal-to-Noise ratio (PSNR). While for the largest training set sizes and network sizes we consider, performance continues to increase, the rate of performance increase for training set sizes beyond a few thousand images slows to a level indicating that even training on millions of images only yields a marginal benefit, see Fig. 1(a) . We also train SOTA SwinIRs of sizes 4M to 129M parameters on the same range of training set sizes, see Fig. 2 (a). Albeit it performs better than the U-Net, its scaling behavior is essentially equivalent, and again after a few thousand images, only marginal benefits are expected. Scaling up its training set and network size did, however, give benefits, the largest model we trained yields new SOTA results on four common test sets by 0.05 to 0.22dB. • Empirical scaling laws for compressive sensing. We train U-Nets of sizes from 2M to 500M parameters for 4x accelerated MRI with training set sizes from 50 to 50k images from the fastMRI dataset (Knoll et al., 2020b) . We again find that beyond a dataset size of about 2.5k, the rate of improvement as a function of the training set size considerably slows, see Fig. 1(c) . This indicates that while models are expected to improve by increasing the training set size, we expect that training on millions of images does not significantly improves performance. • Empirical scaling laws for super-resolution. We also briefly study super-resolution, and similarly as for denoising and compressive sensing find a slowing of the scaling law at moderate dataset sizes. See appendix C. • Understanding scaling laws for denoising theoretically. Our empirical results indicate that the denoising performance of a neural network trained end-to-end doesn't increase as a function of training examples beyond a certain point. This is expected since once we reach a noise specific error floor, more data is not beneficial. We make this intuition precise for a linear estimator learned with early stopped gradient descent to denoise data drawn from a d-dimensional linear subspace. We show that the reconstruction error is upper bounded by d/N plus a noise-dependent error floor level. 

2. RELATED WORK

Scaling laws for prediction problems. Under the umbrella of statistical learning theory convergence rates of 1/N or 1/ √ N have been established for a range of relatively simple models and distributions, see e.g. Wainwright (2019) for an overview. For deep neural networks used in practice, a recent line of work has empirically characterized the performance as a function of training set size and/or network size for classification and NLP (Hestness et al., 2017; Rosenfeld et al., 2019; Kaplan et al., 2020; Bahri et al., 2021; Zhai et al., 2021; Ghorbani et al., 2021; Bansal et al., 2022) . In those domains, the scaling laws persist even for very large datasets, as described in more detail below. In contrast, for image reconstruction, we find that the power-law behavior already slows considerably at relatively small numbers of training examples. Rosenfeld et al. (2019) find power-law scaling of performance with training set and network size across models and datasets for language modeling and classification. However, because the work fixes either training set or network size, while scaling the other, the scaling laws span only moderate ranges before saturating at a level determined by the fixed quantity and not the problem specific error floor. The papers Kaplan et al. (2020) ; Bahri et al. (2021) ; Zhai et al. (2021) ; Ghorbani et al. (2021) ; Bansal et al. (2022) including ours scale the dataset size and model size simultaneously, resulting in an improved predictive power of the obtained scaling laws. Hestness et al. (2017) study models for language and image classification, and attribute deviations from a power-law curve to a lack of fine-tuning the hyperparameters of very large networks. For transformer language models Kaplan et al. (2020) find no deviation from a power-law for up to a training set and network size of 1B images and parameters. Further, Zhai et al. (2021) find the performance for Vision Transformers (Dosovitskiy et al., 2020) for few-shot image classification to deviate from a power-law curve only at extreme model sizes of 0.3-1B parameters and 3B images. The number of available training images for accelerated MRI is limited to few publicly available datasets. Hence, current SOTA (Sriram et al., 2020; Fabian & Soltanolkotabi, 2022) are trained on the largest datasets available (the fastMRI dataset), consisting of 35k images for knee and 70k for brain reconstruction (Zbontar et al., 2018; Muckley et al., 2021) . Adaptations of the MLP-Mixer (Tolstikhin et al., 2021) and the Vision Transformer (Dosovitskiy et al., 2020) to MRI (Mansour et al., 2022; Lin & Heckel, 2022) ablate their performance as a function of the training set size and find that non-CNN based approaches require more data to achieve a similar performance, but potentially also benefit more from even larger datasets. However, those experiments are run with fixed model sizes and only 3-4 realizations of training set size thus it is unclear when performance saturates.

3. EMPIRICAL SCALING LAWS FOR DENOISING

In this section, we consider the problem of estimating an image x from a noisy observation y = x + z, where z is additive Gaussian noise z ∼ N (0, σ 2 z I). Neural networks trained end-to-end to map a noisy observation to a clean image perform best, and outperform classical approaches like BM3D (Dabov et al., 2007) and non-local means (Buades et al., 2005) that are not based on training data. Datasets. To enable studying learned denoisers over a wide range of training set sizes we work with the ImageNet dataset (Russakovsky et al., 2015) . Its training set contains 1.3M images of 1000 different classes. We reserve 20 random classes for validation and testing. We design 10 training set subsets S N of sizes N ∈ [100, 100000] (see Fig. 1(a) ) and with S i ⊆ S j for i ≤ j. To make the distributions of images between the subsets as homogeneous as possible, the number of images from different classes within a subset differ by at most one. We generate noisy images by adding Gaussian noise with variance of σ z = 25 (pixels in the range from 0 to 255), i.e. 20.17dB in PSNR. Model variants and training. We train a CNN-based U-Net, a detailed description of the model architecture is in Appx. A.1. We also train the Swin-Transformer (Liu et al., 2021) based SwinIR model (Liang et al., 2021) that achieves SOTA for a variety of image reconstruction tasks, including denoising. This model is interesting since transformers scale well with the number of training examples for other domains, e.g., for image classification pre-training (Dosovitskiy et al., 2020) . A training set size of around 100 is already sufficient to train a decent image denoiser. Beyond 100, we can fit a linear power law to the performance of the U-Net with a scaling coefficient α = 0.0048 that approximately holds up to training set sizes of about 6k images. Beyond 6k, we can fit a second linear power law with significantly smaller scaling coefficient α = 0.0019. Similarly, for the SwinIR we can fit a power law with coefficient α = 0.0050 in the regime of little training data and a power law with significantly smaller coefficient α = 0.0017 in the regime of moderate training data, thus the two architectures scale essentially equivalently. While denoising benefits from more training examples, the drop in scaling coefficient indicates that scaling from thousands to millions of images is only expected to yield relatively marginal performance gains. While the gains are small and we expect gains from further scaling to be even smaller, our largest SwinIR trained on 100k images yields a new SOTA for Gaussian color image denoising. On four common test sets it achieves an improvement between 0.05 to 0.22dB, see Appx. B. Visualizations of reconstructions from models along the curves in Fig. 1 (a) and Fig. 2 (a) can be found in Fig. 4 , Appx. A. Comparing the U-Net scaling to SwinIR scaling (Fig. 2(a) ) we see that the effect of large training sets on the performance of the U-Net can not make up for the improved modeling error that stems from the superior network architecture of the SwinIR. In fact, the simulated scaling coefficient α = 0.0019 predicts a required training set size of about 150M images for U-Net to achieve the best performance of the SwinIR, and that is an optimistic prediction since it assumes that the scaling does not slow down further for another 3 orders of magnitude. Thus, model improvements such as those obtained by transformers over convolutional networks are larger than what we expect from scaling the number of training examples from tens of thousands to millions. An interesting question is whether different noise levels lead to different data requirements, as indicated by a different scaling behavior. We investigate the performance of U-Net for a smaller noise variance σ z = 15 in Appx. D.2. While the results are still preliminary, the qualitative scaling behavior is similar; the difference are that the scaling law pertaining to smaller noise is slightly steeper, and overall the performance improves by about 1.3dB in PSNR. Finally, note that for our results we choose to re-sample the noise per training image in every training epoch, since it makes the best use of the available clean images. However, fixing the noise is also interesting, since it is more similar to a setup in which the noise statistics are unknown like in real-world noise removal. See Appx. D.1 for additional results where the noise is fixed. We found that compared to re-sampling the noise the performance of a U-Net drops by 0.3dB for small training set sizes. As the training set size increases the performance approaches the one of re-sampling the noise resulting in a steeper scaling law that flattens at slightly larger training set sizes. Robustness of our finding of slowing scaling laws. Our main finding for denoising is that the scaling laws slow at relatively small amounts of training data points (i.e., the scaling coefficient becomes very small). We next argue why we think that this is a robust finding. In principle it could be that the networks are not sufficiently large for the performance to improve further as the dataset size increases. However, for the U-Net and training set sizes of 10k and 30k the performance increase already slows, and for those training set sizes, increasing the network size does not improve performance, as shown in Fig. 1(b) . For smaller network sizes the curves in Fig. 1(b ) first increase and then decrease indicating that we found network sizes close to the optimum. The parameter scaling of the SwinIR in Fig. 2 (b) indicates that for training set sizes of 10k and 100k the performance still benefits from larger networks. Hence, the power law for those training set sizes in Fig. 2 (a) can become slightly steeper once we further increase the network size. However, for the slope of the flat power law in (a) to reach the slope of the steep power law, the parameter scaling in (b) for 100k training images would need to hold for another 2 orders of magnitude, i.e. about 12B instead of the currently used 129M network parameters, which is very unlikely considering that the current network sizes already suffice to saturate the performance for small/moderate training set sizes. Finally, it could be that with higher quality or higher diversity of the data, the denoising performance would further improve. In our experiments, we increase the training set sizes by adding more images from the same classes of ImageNet, and we continue to test on different classes. Thus, it could be that adding training examples from a different, more diverse data source leads to less of a slowing of the scaling law in Fig. 1(a) . We test this hypothesis by scaling our training set with 3k images to 10k images not by adding more images from ImageNet but images from the datasets that are used to train the original SwinIR, i.e., we add all images from DIV2K (Agustsson & Timofte, 2017) , Flickr2K (Timofte et al., 2017) and BSD500 (Arbeláez et al., 2011) and add the remaining images from WED (Ma et al., 2017) . Keeping all other hyperparameters fixed the U-Net obtains for this dataset a PSNR of 32.239dB, which is slightly worse than the 32.252dB it achieves with our training set with 10k images only from ImageNet. Hence, we reject the hypothesis that the drop in scaling coefficients can be explained with how the training sets are designed in our experiment.

4. EMPIRICAL SCALING LAWS FOR COMPRESSIVE SENSING

We consider compressive sensing (CS) to achieve 4x accelerated MRI. MRI is an important medical imaging technique for its non-invasiveness and high accuracy. Due to the physics of the measurement process, MRI is inherently slow. Accelerated MRI aims at significantly reducing the scan time by undersampling the measurements. In addition, MRI scanners rely on parallel imaging, which uses multiple receiver coils to simultaneously collect different measurements of the same object. This leads to the following reconstruction problem. We are given measurements y i ∈ C m of the image x ∈ C n as y i = MFS i x + noise i for i = 1, ..., C, and our goal is to reconstruct the image from those measurements. Here, C is the number of receiver coils, the diagonal matrices S i ∈ C n×n are the sensitivity maps modelling the signal strength perceived by the i-th coil, F ∈ C n×n is the discrete Fourier transform (DFT), and M ∈ C m×n is a binary mask implementing the undersampling. Classical CS approaches (Lustig et al., 2008) first estimate the sensitivity maps with a method such as ESPIRiT (Uecker et al., 2014) , and second estimate the unknown image by solving a regularized optimization problem, such as total-variation norm minimization. Recently, deep learning based methods have been shown to significantly outperform classical CS methods due to their ability to learn more complex and accurate signal models from data (Knoll et al., 2020a; Muckley et al., 2021) . Datasets. To explore the performance of learning based MR reconstruction over a wide range of training set sizes, we use the fastMRI multi-coil brain dataset (Knoll et al., 2020b) , which is the largest publicly available dataset for MRI. The dataset consists of images of different contrasts. To ensure that the statistics of the dataset are as homogeneous as possible, we take the subset of the dataset corresponding to a single contrast resulting in 50k training images. We design training set subsets S N of size N ∈ [50, 50000] (see Fig. 1 (c )) with S i ⊆ S j for i ≤ j. For more information on selection, division and subsampling of the dataset see Appx. A.2. Model variants and training. We train the same U-Net model used in Section 3 and described in Appx. A.1 but with 4 blocks per encoder/decoder. The network is trained end-to-end to map a coarse reconstruction xCR to the ground truth image x. The coarse reconstruction is obtained as xCR = C i=1 |F -1 y i,ZF | 2 1/2 ,where F -1 is the inverse DFT, and y i,ZF are the undersampled measurements of the i-th coil, where missing entries are filled with zeros. We vary the number of network parameters as depicted in Fig. 1 (d) . Appx. A.2 contains detailed descriptions on how the model is trained and the network size is adapted. Results and discussion. For each training set size Fig. 1 (c ) shows the reconstruction performance in structural similarity (SSIM) of the best model over all simulated network sizes. There are 2 main findings. First, we can fit a linear power law with a scaling coefficient α = 0.0072 that holds up to training set sizes of about 2.5k images. Second, for training set sizes starting from 5k we can fit a second linear power law but with significantly smaller scaling coefficient α = 0.0026. Similar to denoising in Section 3 we conclude that while accelerated MRI slightly benefits form more training examples, the drop in scaling coefficient indicates that it is unlikely that even training set sizes of the order of hundreds of millions of images result in substantial gains. Visualizations of reconstructions from models along the curve in Fig. 1 (c ) can be found in Fig. 5 , Appx. A.2. Robustness of our finding of slowing scaling laws. Next, we argue why the drop in scaling coefficient for accelerated MRI is a robust finding. In Fig. 1 (d ) we demonstrate that the network size does not bottleneck the performance of our largest training sets. For each training set size the performance as a function of the number of network parameters is relatively flat. Even small training sets benefit from large networks before their performance slightly decays for very large networks. Hence, we expect that for large training sets the performance as a function of network parameters would not increase significantly before decaying. Finally, is there a different type of training data that would improve the scaling coefficient? We don't think so, since all experiments are for the very specific task of reconstructing brain images of one particular contrast, which means that the examples that we add to increase the size of the training sets is already the data most suitable to solve this task.

5. UNDERSTANDING SCALING LAWS FOR DENOISING THEORETICALLY

We study a simple linear denoiser that is trained end-to-end to reconstruct a clean image from a noisy observation, in order to understand how the error as a function of the number of training examples for inverse problems is expected to look like. We define a joint distribution over a signal and the corresponding measurement (x, y) as follows. Consider a d < n dimensional subspace of R n parameterized by the orthonormal basis U ∈ R n×d . We draw a signal approximately uniformly from the subspace x = Uc, where c ∼ N (0, I), and an associated noisy measurement as y = x + z, where z ∼ N (0, σ 2 z I) is Gaussian noise. The subspace is unknown, but we are given a training set {(x 1 , y 1 ), . . . , (x N , y N )}, consisting of examples drawn iid from the joint distribution over (x, y). We assume that the signal lies in a low dimensional subspace. Assuming that data lies in a lowdimensional subspace or more general, a union of low-dimensional subspaces, is common, for example it underlies the denoising of natural images via wavelet thresholding (Donoho & Johnstone, 1995; Simoncelli & Adelson, 1996; Chang et al., 2000) . Mohan et al. (2020) found that even deep learning based denoisers implicitly perform a projection onto an adaptively-selected low-dimensional subspace that captures the features of natural images. We consider a linear estimator of the form f W (y) = Wy, and measure performance in terms of the expected mean-squared reconstruction error, defined as R(W) = 1 d E ∥Wy -x∥ 2 2 , where expectation is over the joint distribution of (x, y). The optimal linear estimator. The optimal linear estimator (i.e., the estimator that minimizes the risk R) is given by W * = 1 1+σ 2 z UU T . The estimator projects the data onto the subspace and shrinks towards zero, depending on the noise variance. The associated risk is R(W * ) = σ 2 z /(1 + σ 2 z ). An estimator based on subspace estimation. The optimal estimator W * requires knowledge of the unknown subspace. We can learn the subspace from the noisy data by performing principal component analysis on Y = [y 1 , . . . , y N ]. Specifically, we estimate the subspace as the d-leading Published as a conference paper at ICLR 2023 singular vectors of the empirical co-variance matrix YY T ∈ R n×n , denoted by by Û ∈ R n×d . If the measurement noise is zero (i.e., σ 2 z = 0), Û is an orthonormal basis for the d-dimensional subspace U provided we observe at least d many linearly independent signals, which occurs with probability one if we draw data according to the model defined above, and if the number of training examples obeys N ≥ d. There is a vast literature on PCA; see Vershynin (2011; 2018) ; Rudelson & Vershynin (2010) ; Tropp (2012) for tools from high-dimensional probability to analyze the PCA estimate.  (d + nσ 2 z ) log(n) ≤ N . For a numerical constant c, with probability at least 1 -n -10 -3e -d + e -n , the risk of the PCA-estimate is bounded by R (W PCA ) ≤ R(W * ) + c(d + nσ 2 z ) log(n)/N. Thus, as long as the number of training examples N is sufficiently large relative to (d + nσ 2 z ), the risk of the PCA-estimator is close to the risk of the optimal estimator. Estimator learned end-to-end. We now consider an estimator learned end-to-end, by applying gradient descent to the empirical risk L(W) = N i=1 ∥Wy i -x i ∥ 2 2 and we regularize via earlystopping the gradient descent iterations. The risk of the estimate after k iterations of gradient descent, W k , is bounded by the next result. This estimator mimics the supervised training we consider throughout this section, with the difference that here we consider a simple neural network, and in the previous sections we trained a neural network. Theorem 2. Let W k be the matrix obtained by applying k iterations of gradient descent with stepsize η starting at W 0 = 0 to the loss L(W). Consider the regime where the number of training examples obeys (d + nσ 2 z ) log(n) ≤ N ≤ ξd/σ 2 z for an arbitrary ξ and N log(N ) ≤ n. For an appropriate choice of the stepsize η there exists an optimal early stopping time k opt at which the risk of the estimator f W (y) = W kopt y is upper-bounded with probability at least 1 -2e -N/8 -2e -N/18 -5n -9 -5e -d -2e -n -e -N -2e -n/2 by R(W kopt ) ≤ (8 + 2ξ)R(W * ) + c 1 + nσ 2 z d log n (d + nσ 2 z ) log(n) N + cξ (d + σ 2 z n) log n N , where c is a numerical constant. The proof is provided in Appx. G. Similar to Thm. 1 the first term in the bound corresponds to the noise-dependent error floor and the second two terms decrease in (d + nσ 2 z )/N and represent the error induced by learning the estimator with a finite number of training examples. Once this error is small relative to the error floor, more training examples do not improve performance. See Appx. E for a more detailed discussion on the estimator and the assumptions of Thm. 2. Discussion. In Fig. 3 , we plot the risk as a function of the training set size for the early-stopped empirical risk minimization (ERM) and the PCA based estimator. Starting at a training set size of N = 100, we see that the risk minus the optimal risk follows approximately a power law, i.e., log(R(W) -R(W * )) ≈ -α log(N ), as suggested by Thms. 1 and 2. In practice, however, we don't know the risk of the optimal estimator, R(W * ). Therefore, in the second row of Fig. 3 and throughout our empirical results we plot the risk log(R(W)) as a function of the number of training examples log(N ) and distinguish different regions by fitting approximated scaling coefficients α to log(R(W)) ≈ -α log(N ). In our theoretical example, we can identify three regions, coined in Hestness et al. (2017) : The small data region in which the training set size does not suffice to learn a well-performing mapping and the power-law region in which the performance decays approximately as N α with α < 0, and an problem-specific irreducible error region, which is R(W * ) here. Note that the power-law coefficient in the risk plot R(W) are smaller then those when plotting R(W) -R(W * ), an are only approximate scaling coefficients. Already in this highly simplified setup of an inverse problem the true scaling coefficients heavily depend on model parameters such as the noise variance as can be seen in Fig. 3 . The scaling coefficients further vary with the signal dimension d and ambient dimension n (see Appx. E.1). Limitations. This finding is based on studying three different reconstruction problems (denoising, super-resolution, and compressive sensing), two architectures (U-net and SwinIR), and for each setup we extensively optimized hyperparameters and carefully scaled the networks. Scaling gave new state-of-the-art results on four denoising datasets, which provides some confidence in the setup. Moreover, our statements necessarily pertain to the architectures and metrics we study. It is possible that other architectures or another scaling of architectures can yield larger improvements when scaling architectures. It is also widely acknowledged that image quality metrics (such as SSIM and PSNR) do not fully capture the perceived image quality by humans, and it could be that more training data yields improvements in image quality that are not captured well by SSIM and PSNR. Most importantly, our findings pertain to standard in-distribution evaluation, i.e., the test images are from the same distribution as the training images. To achieve robustness against testing on images out of the training distribution recent work indicates a positive effect of larger and more diverse training sets (Miller et al., 2021; Darestani et al., 2022; Nguyen et al., 2022) . Thus it is possible that the out-of-distribution performance of image reconstruction methods improves when trained on a larger and a more diverse dataset, even though the in-distribution performance does not improve further. Future research. We focus on supervised image reconstruction methods. For image denoising and other image reconstruction problems self-supervised approaches are also performing very well (Laine et al., 2019; Wang et al., 2022; Zhou et al., 2022) , even though supervised methods perform best if clean images are available. It would be very interesting to investigate the scaling laws for selfsupervised methods; perhaps more training examples are required to achieve the performance of a supervised setup. We leave this for future work.

A DETAILS OF THE EXPERIMENTAL SETUPS A.1 EXPERIMENTAL DETAILS FOR EMPIRICAL SCALING LAWS FOR DENOISING WITH A U-NET

In this Section, we give a detailed description of the experimental setup that led to our results for Gaussian denoising with a U-Net presented in Fig. 1 (a),(b) and Section 3. In addition, Fig. 4 shows examples of reconstructions from different models along the performance curve in Fig. 1 (a) . In these examples the improvement in perceived image quality from increasing the training set size from 100 to 1000 is larger than from increasing from 1000 to 10000 or from 10000 to 100000. This correlates with our quantitative findings in Fig. 1 (a) . Next, we describe the experimental details. We train U-Nets with two blocks in the encoder and decoder part respectively and skip connections between blocks. Each block consists of two convolutional layers with LeakyReLU activation and instance normalization (Ulyanov et al., 2017) after every layer, where the number of channels is doubled (halved) after every block in the encoder (decoder). The downsampling in the encoder is implemented as average pooling and the upsampling in the decoder as transposed convolutions. As proposed by Zhang et al. (2017) we train a residual denoiser that learns to predict yx instead of directly predicting x, which improves performance. We scale the network size by increasing the width of the network by scaling the number of channels per (transposed) convolutional layer. For denoising we trained U-Nets of 7 different sizes. We vary the number of channels in the first layer in {16, 32, 64, 128, 192, 256, 320} , which corresponds to {0.1, 0.5, 1.9, 7.4, 16.7, 30.0, 46.5} million parameters. We do not vary the depth, since this would change the dimension of the informational bottleneck in the U-Net and thus change the model family itself. The exact training set sizes we consider are {0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60, 100} thousand images from ImageNet. We center crop the training images to 256 × 256 pixels. We also tried a smaller patch size of 128 × 128 pixels, but larger patches showed to have a better performance in the regime of large training set sizes, which is why the results presented in the main body are for patch size 256 × 256. For a comparison between the scaling behavior of the two different patch sizes see Fig. 9 , Appx. D.3. We do not use any data augmentation, since it is unclear how to account for it in the number of training examples. For validation and testing we use 80 and 300 images respectively taken from 20 random classes that are not used for training. We use mean-squared-error loss and Adam (Kingma & Ba, 2014) optimizer with β 1 = 0.9, β 2 = 0.999. For moderate training set sizes up to 3000 images we find that heuristically adjusting the initial learning rate with the help of an automated learning rate annealing performs well. To this end, we start with a learning rate of 10 -4 and increase after every epoch by a factor of 2 until the validation loss does not improve for 3 consecutive epochs. We then load the model checkpoint from the learning rate that was still performing well and continue with that learning rate. We observe that this scheme typically picks an initial learning rate reduced by factor 2 for every increase in the number of channels C. For training set sizes larger than 3000 we directly apply this rule to pick an initial learning rate without annealing as we found this to give slightly better results. In particular, for number of channels C = {128, 192, 256, 320} we used initial learning rates η = {0.0032, 0.0016, 0.0008, 0.0004}. For small training sets up to 1000 images we found that batch size of 1 works best. For larger training sets we use a batch size of 10 and found that further increasing the batch size does not improve performance. We do not put a limit on the amount of compute for training. We use an automated learning rate decay that reduces the learning rate by 0.5 if the validation PSNR has not improved by at least 0.001 for 10 epochs or 6 epochs for training set sizes starting from 6000 images. Once the learning rate drops to 10 -5 we observe near to no gains in validation loss and stop the training after 10 additional epochs. For training set sizes up to 1000 we train 3 random seeds and pick the best run. As the variance between runs compared to the gain in performance between different training set sizes decreases with increasing training set size we only run one seed for larger training set sizes. The experiments were conducted on four NVIDIA A40, four NVIDIA RTX A6000 and four NVIDIA Quadro RTX 6000 GPUs. We measure the time in GPU hours until the best epoch according to the validation loss resulting in about 1800 GPU hours for the experiments in Fig. 1 (a),(b).

A.2 EXPERIMENTAL DETAILS FOR EMPIRICAL SCALING LAWS FOR COMPRESSIVE SENSING

WITH A U-NET In this Section, we give a detailed description of the experimental setup that led to our results for compressive sensing MRI presented in Fig. 1 (c),(d) and Section 4. In addition, Fig. 5 shows examples of reconstructions from different models along the performance curve in Fig. 1 (c ). In these examples the improvement in perceived image quality from increasing the training set size from 500 to 2500 is larger than from increasing from 2500 to 10000 or from 10000 to 50000. This correlates with our quantitative findings in Fig. 1 (c ). The first in row in Fig. 5 shows an example in which all models including the one trained on the largest training set fail to recover a fine detail. This is a known problem in accelerated MRI and has been documented in both editions of the fast MRI challenge (Knoll et al., 2020a; Muckley et al., 2021) in which for all methods examples could be found in which fine details have not been recovered. However, the question remains if more training data or better models would help or the details are simply not there since the information is lost due to the large undersampling factors considered in the fastMRI challenges and in this work. Next, we describe the experimental details. For compressed sensing in the context of accelerated MRI we trained U-Nets of 14 different sizes. We vary the number of channels in the first layer in {16, 32, 48, 64, 96, 112, 128, 144, 160, 176, 192, 208, 224, 256}, which corresponds to {2, 8, 18, 31, 70, 95, 124, 157, 193, 234, 279, 327, 380 , 496} million network parameters. The exact training set sizes we consider are {0.05, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50} thousand AXT2 weighted images from fastMRI multi-coil brain dataset (Zbontar et al., 2018) , where AXT2 corresponds to all images of one type of contrast. We focused on images only from this type to make the statistics of our datasets as homogeneous as possible. We do not use any data augmentation, since it is unclear how to account for it in the number of training examples. We use 4732 and 730 additional images for testing and validation. We consider an acceleration factor of 4 meaning that we only measure 25% of the information. We obtain the 4 times undersampled measurements by masking the fully sampled measurement y with an equispaced mask with 8% center fractions meaning that in the center of y we take all the measurements and take the remaining ones at equispaced intervals. We use structural similarity (SSIM) loss and RMSprop optimizer with α = 0.99 as this is the default in the fastMRI repository (Zbontar et al., 2018) and we found no improvement by replacing it with Adam. We do not put a limit on the amount of compute invested into training. We deploy an automated learning rate decay that starts at a learning rate of 10 -3 and decays by a factor 0.1 if the validation SSIM has not improved by at least 10 -4 for 5 epochs. Once the learning rate drops to 10 -6 we stop the training after 10 additional epochs. Only for the largest training set sizes 25k,50k we found that an additional drop to 10 -7 resulted in further performance gains. We use a batch size of 1. For training set sizes up to 5k we train three models with random seeds and pick the best. For larger training set sizes we only run one seed, since the variance between runs decreased. The experiments were conducted on four NVIDIA A40, four NVIDIA RTX A6000 and four NVIDIA Quadro RTX 6000 GPUs. We measure the time in GPU hours until the best epoch according to the validation loss resulting in about 4250 GPU hours for the experiments in Fig. 1 (c),(d).

A.3 EXPERIMENTAL DETAILS FOR EMPIRICAL SCALING LAWS FOR DENOISING WITH THE SWINIR

In this Section, we give a detailed description of the experimental setup that led to our results for Gaussian denoising with a SwinIR presented in Fig. 2 (a),(b) and Section 3. In addition, Fig. 4 shows examples of reconstructions from different models along the performance curve in Fig. 2 it is difficult for the naked eye to notice large differences in the quality of the reconstructions. This indicates that for Gaussian denoising both models already operate in a regime, where improvements to be made are only marginal. Next, we describe the experimental details. To obtain Fig. 2 We train 4 different network sizes with {3.7, 11.5, 41.8, 128.9} million parameters. We denote the four network sizes as small(S)/middle(M)/large(L)/huge(H). The training details and network configurations are as follows. The default SwinIR for denoising (Liang et al., 2021) was proposed for a training set size of about 10k images, 11.5M network parameters and was trained with a batch size of 8 for T =1280 epochs, where the learning rate is halved at [0.5T ,0.75T , 0.875T , 0.9375T ] epochs. We keep the learning rate schedule but adjust the maximal number of epochs T according to the training set size. Table 1 shows batch size and maximal number of epochs for every experiment in Fig. 2 . We did not optimize over the choice of the batch size but picked the batch size as prescribed by the availability of computational resources. See Liang et al. (2021) for a detailed description of the SwinIR network architecture. We vary the network size by adjusting the number of residual Swin Transformer blocks, the number of Swin Transformers per block, the number of attention heads per Swin Transformer, the number of channels in the input embedding and the width of the fully connected layers in a Swin Transformer. Table 2 contains a summary of the settings. When scaling up the network size, we invested in the parameters that seemed to be most promising in the ablation studies in Liang et al. (2021) . The experiments were conducted on four NVIDIA A40 and four NVIDIA RTX A6000. We used about 13000 GPU hours for the tranformer experiments in Fig. 2 . For training the models in parallel on multiple GPUs we utilize the torch.distributed package with the glow backend, instead of the faster nccl backend, which was unfortunately not available on our hardware at that time. In Fig. 2 in the main body we compared the performance of the U-Nets and the SwinIRs trained on subsets of ImageNet for image denoising. The models are evaluated on a test set sampled from ImageNet. We observed that while SwinIRs significantly outperform U-Nets, the performance gain from increasing the training set size slows already at moderate training set sizes for both architectures equally. However, there is still a moderat performance gain in scaling the models, and thus we expect the largest SwinIR trained on the largest dataset to outperform the original SwinIR from Liang et al. (2021) . Our results in this section show that this is indeed the case. In Table 3 we evaluate on the standard test sets for Gaussian color image denoising CBSD68 (Martin et al., 2001) , Kodak24 (Franzen, 1999) , McMaster (Zhang et al., 2011) and Urban100 (Huang et al., 2015) . We observe a significant performance difference between the best U-Net (46.5M parameters, 100k training images) and the other transformer based methods. As expected, our largest SwinIR trained on the largest dataset SwinIR 100/H outperforms the original SwinIR, but also the SCUnet (Zhang et al., 2022) a later SOTA model that has been demonstrated to outperform the original SwinIR. We also depict the gains for the SwinIR from just scaling up the network size and then from scaling up network size and training set size. While this led to a new SOTA for Gaussian image denoising, note that on the downside training the SwinIR 10/M, which is comparable to the original SwinIR, took about 2 weeks on 4 NVIDIA A40 gpus, while training the SwinIR 100/H took over 2 months.

C EMPIRICAL SCALING LAWS FOR IMAGE SUPER-RESOLUTION WITH A U-NET

In this Section, we consider the problem of super-resolution, i.e., estimating an high-resolution image from a low-resolution version of the image. This can be viewed as a compressive sensing problem, since we can view the super-resolution problem as reconstructing a signal x ∈ R n from a downsampled version y = Ax, where the matrix A ∈ R m×n implements a downsampling operation like bicubic downsampling, or blurring followed by downsampling. As first shown in the pioneering work of Dong et al. (2014) data driven neural networks trained end-to-end outperform classical model-based approaches (Gu et al., 2012; Michaeli & Irani, 2013; Timofte et al., 2013) . Dong et al. (2014) reports that for super-resolution with a simple three-layer CNN, the gains from very large training sets do not seem to be as impressive as in high-level vision problems like image classification. Liang et al. (2021) plot the super-resolution performance of the SOTA SwinIR model as a function of the training set size up to 3600 training examples for a fixed network size. When plotting their results on a logarithmic scale, we observe that the performance improvement follows a power-law. It is unclear, however, whether this power law slows beyond this relatively small number of images. In this Section, we obtain scaling laws for super-resolution for a U-Net over a wide range of training set and network sizes, similar as for denoising and compressive sensing in the main body. Model variants and training. We train the same U-Net model used in Section 3 and described in Appx. A.1 but with only one block per encoder/decoder as this resulted in slightly better results than with two blocks. The network is trained end-to-end to map a coarse reconstruction, obtained through bicubic upsampling, to the residual between the coarse reconstruction and the high-resolution ground truth image. We vary the number of the channels in the first layer in {8, 16, 32, 64, 128, 192, 256, 320, 448, 564}, which corresponds to {0.007, 0.025, 0.10, 0.40, 1.6, 3.6, 6.4, 10.0, 20.0, 31. 2} million network parameters. We do not use any data augmentation, since it is unclear how to account for it in the number of training examples. We use the ℓ 1 -loss and Adam optimizer with its default settings. For all experiments we find a good initial learning rate with the same annealing strategy as described in Appx. A.1. However, instead of picking the largest learning rate for which the validation loss does not diverge, we pick the second largest, which leads to slightly more stable results. We start the annealing with learning rate of 10 -5 . In the few cases, where our heuristic leads to a degenerated training curve, typically due to picking a significantly too small or too large learning rate, starting the annealing with a smaller learning rate of 10 -6 resolves the problem. We do not put a limit on the amount of compute invested into training. To this end, we deploy an automated learning rate decay that reduces the learning rate by 0.5 if the validation PSNR has not improved by at least 0.001 for 8 epochs. We stop the training once the validation loss did not improve for two consecutive learning rates. For training sets up to 10000 images we found that batch size of 1 works best. For larger training sets we use a batch size of 10. For training set sizes up to 10000 we train 3 random seeds and pick the best run. As the variance between runs compared to the gain in performance between different training set sizes decreases with increasing training set size we only run one seed for larger training set sizes. The experiments were conducted on four NVIDIA A40, four NVIDIA RTX A6000 and four NVIDIA Quadro RTX 6000 GPUs. We measure the time in GPU hours until the best epoch according to the validation loss resulting in about 1500 GPU hours for the experiments in Fig. 6 . A linear power law with a scaling coefficient α = 0.0075 holds roughly up to training set sizes of about 30k images, and for training set sizes starting from 60k this slows to a linear power law with significantly smaller scaling coefficient α = 0.0029. With this slowed scaling law a training set size of 1.6B images would be required to increase performance by another 1dB (assuming the relation 32.05N 0.0029 persists, which is likely to slow down even further). While slowing already at a few tens of thousands of training images, the scaling laws for superresolution do not slow as early as those for denoising and compressive sensing (see Fig. 1 ). This could be partially due training on image patches of size 128 × 128 as opposed to size 256 × 256 used for denoising. 

D.1 EMPIRICAL SCALING LAWS FOR DENOISING WITH FIXED NOISE

Our main results for denoising with a U-Net trained end-to-end discussed in Section 3 follow a setup in which the noise per training example is re-sampled in every training epoch. We choose this setup since it makes the best use of the available clean images. However, fixing the noise is also interesting since it is closer to a denoising setup in which the noise statistics are unknown, which is the case in some real-world noise removal problems. In such a problem, we would be given pairs of noisy and clean image, and could not synthesis new noisy images from the clean images. In this section, we follow the same experimental setup from Appx. A.1 to simulate the performance of a U-Net for denoising with fixed noise for up to 100k training images (see Fig. 7 ). Compared to re-sampling the noise, we observe a drop in performance of about 0.3dB for small training set sizes. The performance difference at 10k images is reduced to 0.2dB resulting in a slightly steeper scaling law for moderate training set sizes. However, at around 10k training images the scaling of the performance of training with fixed noise also starts to flatten as it approaches the performance of re-sampling the noise during training. This indicates that if the noise statistics are unknown, more data is required to achieve the same performance as when they are known. However, in both cases the scaling with training set size slows already down at moderate training set sizes.

D.2 EMPIRICAL SCALING LAWS FOR DENOISING WITH A SMALLER NOISE LEVEL

Our results for Gaussian denoising in Section 3 are for a fixed noise level of σ z = 25. In this Section, we repeat the experiments for Gaussian denoising with a U-Net described in Appx. A.1 with smaller noise level of σ z = 15, in order to see how the scaling laws change. The results for both noise levels are depicted in Fig. 8 . We observe an improvement of about 2.3dB in PSNR, which is expected since the irreducible error decreases for smaller noise levels. We also observe that the scaling coefficient for the smaller noise level σ z = 15 (i.e., α = 0.0026) is slightly steeper than that for the larger noise level σ z = 25 (i.e., α = 0.0019). This coincides with the qualitative behavior of the curves for subspace denoising in Figure 3 . Apart from that the curves are qualitatively similar, in that a initially steep power law is replaced by a slower one at around 6000 training images. We see that in the regime of large training set sizes larger patches performs better than more but smaller patches.

D.3 EMPIRICAL SCALING LAWS FOR DENOISING WITH A SMALLER PATCH SIZE

Our results for Gaussian denoising in Section 3 with a U-Net were obtained for a constant training patch size of 256 × 256 pixels across all network and training set sizes. In this Section, we repeat the experiments for Gaussian denoising with a U-Net described in Appx. A.1 with a smaller patch size of 128 × 128 pixels. The results for both patch sizes are depicted in Fig. 9 . We observe that in the regime of large training set sizes, that we are primarily interested in, training on N patches of size 256×256 is more beneficial than training on 4N patches of size 128 × 128. We therefore focus on patch size 256 × 256 in the main body of this work.

E UNDERSTANDING SCALING LAWS FOR DENOISING THEORETICALLY -SUPPLEMENTARY RESULTS

In this section, we provide additional details on the statements in Section 5 on understanding scaling laws for denoising theoretically by studying a linear subspace denoising problem theoretically, and provide additional numerical results. Recall that we consider a linear estimator of the form f W (y) = Wy, and measure performance in terms of the expected mean-squared reconstruction error (normalized by the latent signal dimension d): R(W) = 1 d E ∥Wy -x∥ 2 2 = 1 d ∥(W -I)U∥ 2 F + σ 2 z d ∥W∥ 2 F . Above, expectation is over the joint distribution of (x, y), and the second equality follows from using that x = Uc, where c ∼ N (0, I) is Gaussian, and y = x + z, where the noise z ∼ N (0, σ 2 z I) is Gaussian. The optimal linear estimator. The optimal linear estimator (i.e., the estimator that minimizes the risk defined in equation ( 1)) is given by W * = 1 1+σ 2 z UU T . This follows from taking the gradient of the risk (1), setting it to zero, and solving for W. The estimator projects the data onto the subspace and shrinks towards zero, depending on the noise variance. The associated risk is Training set size N α=-1.5, d=100 α=-1.2, d=50 α=-0.9, d=10 singular values of YY T . We define Û⊥ ∈ R n×n-d as the orthogonal complement of Û. Starting from the risk expression given in equation ( 1) we obtain R(W * ) = σ 2 z /(1 + σ 2 z ). R(W PCA ) = 1 d (τ Û ÛT -I)U 2 F + 1 d τ 2 σ 2 z Û ÛT 2 F = 1 d ((τ -1) Û ÛT + Û⊥ ÛT ⊥ )U 2 F + τ 2 σ 2 z = 1 d (τ -1) 2 ÛT U 2 F + 1 d ÛT ⊥ U 2 F + τ 2 σ 2 z = 1 d (τ -1) 2 d -ÛT ⊥ U 2 F + 1 d ÛT ⊥ U 2 F + τ 2 σ 2 z = (1 -(τ -1) 2 ) 1 d ÛT ⊥ U 2 F + (τ -1) 2 + σ 2 z τ 2 = 1 + 2σ 2 z (1 + σ 2 z ) 2 1 d ÛT ⊥ U 2 F + σ 2 z 1 + σ 2 z (i) ≤ 1 + 2σ 2 z (1 + σ 2 z ) 2 ÛT ⊥ U 2 + σ 2 z 1 + σ 2 z (ii) ≤ c 1 + 2σ 2 z (1 + σ 2 z ) 2 (d + nσ 2 z ) log(2n) N + σ 2 z 1 + σ 2 z ≤ c (d + nσ 2 z ) log(n) N + σ 2 z 1 + σ 2 z . Here, inequality (ii) follows from Section H.1 equation ( 34) and holds in the regime (d + nσ 2 z ) log(n) ≤ N , for some constant c and with probability at least 1 -n -10 -3e -d + e -n . This concludes the proof.

G PROOF OF THEOREM 2: RISK BOUND FOR EARLY STOPPED EMPIRICAL RISK MINIMIZATION

This section contains the proof of Theorem 2. The theorem characterizes the performance of the learned, linear estimator W k that is obtained by applying k iterations of gradient descent with stepsize η starting at W 0 = 0 to the loss L(W) in equation ( 2). We start by deriving a closed form expression for the estimator W k . The gradient of the loss is ∇ W L(W) = (WY -X)Y T , and thus the iterations of gradient descent are W k+1 = W k -η(W k Y -X)Y T = W k (I -ηYY T ) + ηXY T . Let Y = U y Σ y V T y ∈ R n×N and X = U x Σ x V T x ∈ R n×N be the singular value decompositions of Y and X respectively and assume that the singular values are non-zero and descending, i.e., σ y,1 ≥ σ y,2 ≥ . . . . We have, with W 0 = 0 and k ≥ 1 that W k = ηXY T k-1 ℓ=0 (I -ηYY T ) ℓ = ηXV y Σ y U T y k-1 ℓ=0 (I -ηU y Σ 2 y U T y ) ℓ = ηXV y Σ y diag k-1 ℓ=0 (1 -ησ 2 y,i ) ℓ U T y = XV y D k U T y , where we defined D k ∈ R N ×N as a diagonal matrix with i-th diagonal entry given by (1 -(1ησ 2 y,i ) k )/σ y,i and where we used the geometric series to obtain k-1 ℓ=0 (1 -ησ 2 y,i ) ℓ = 1 -(1 -ησ 2 y,i ) k ησ 2 y,i . Note that for k → ∞ and choosing η such that 1 -ησ 2 y,i < 1 for all i we get W ∞ = XV y Σ y -1 U T y = XY † . Evaluating the risk from equation (1) at the estimator W k gives R(W k ) = 1 d (XV y D k U T y -I)U 2 F + σ 2 z d XV y D k U T y 2 F . To shorten notation we define γ := (d + nσ 2 z ) log(n) N (7) ψ := nσ 2 z log(n) N . We next provide bounds for the two terms on the right-hand-side of equation ( 6), proven later in this section. Bound on the first term in equation ( 6): In Section G.1 we show that provided N log(N ) ≤ n, 9d ≤ N and (d + nσ 2 z ) log(n) ≤ N , for some constant c, with probability at least 1 -2e -N/18 -3n -10 -3e -d -e -n -e -N -2e -n/2 , the following bound holds: 1 d (XV y D k U T y -I)U 2 F ≤ c d σ 2 z n + γσ 2 x,max d i=1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + c d d i=1 (1 -ησ 2 y,i ) 2k + c d ψγ N i=d+1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + cγ. ( ) Bound on the second term in equation (6): In Section G.2 we show σ 2 z d XV y D k U T y 2 F ≤ σ 2 z d N i=1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 . ( ) With equations ( 9) and ( 10) in place and by splitting up the sum in (10) we can bound the right hand side of equation ( 6) as R(W k ) ≤ 1 d cσ 2 z n + cγσ 2 x,max + σ 2 z σ 2 x,max d i=1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + c d d i=1 (1 -ησ 2 y,i ) 2k + cγ + c d ψγ + σ 2 z N i=d+1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 . ( ) Since the singular values of Y are in descending order σ y,1 ≥ σ y,2 ≥ . . ., this is, for any iteration k, bounded as R(W k ) ≤ cσ 2 z n + cγσ 2 x,max + σ 2 z σ 2 x,max 1 σ 2 y,d + c(1 -ησ 2 y,d ) 2k + cγ + c d ψγ + σ 2 z σ 2 x,max N i=d+1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 . ( ) This is a good bound if we're at an iteration sufficiently large so that (1 -ησ 2 y,1 ) k is small. In (12) we have (1 -ησ 2 y,d ) 2k that is decreasing in the number of gradient descent steps k and also decreasing in the stepsize η as long as ησ 2 y,i ≤ 1 for i = 1, . . . , d. Further, we have (1-(1-ησ 2 y,i ) k ) 2 that is increasing in k and also increasing in η. The first term corresponds to the signal that we want to fit sufficiently well, whereas the second term corresponds to the noise from which we want to fit as little as possible. Hence, there exist optimal choices for k, η that trade-off the sum of the two terms. In our setup the d leading singular values of Y corresponding to the signal are large and concentrate around N (1 + σ 2 z ), while the remaining singular values are small and concentrate around N σ 2 z . Hence, we can apply a single step of gradient descent k = 1 to already fit a large portion of the signal, while minimizing the portion of the noise that is fitted. For that we choose the stepsize η as large as possible such that ησ 2 y,i ≤ 1 for i = 1, . . . , d still holds. Next, suppose the following events hold E 1 = {σ 2 y,d+1 ≤ N σ 2 z + ϵ(1 + σ 2 z ) } E 2 = {σ 2 y,d ≥ N (1 + σ 2 z )(1 -ϵ)} E 3 = {σ 2 x,max ≤ 4N } (15) E 4 = {σ 2 y,1 ≤ N (1 + ϵ)(1 + σ 2 z )}. In Section H.4, we show that P [E 3 ] ≥ 1 -2e -N/8 . In Section H.4, we also show that, provided that N ≥ 3Cϵ -2 (d + σ 2 z n) log n, for some constant C and ϵ ∈ (0, 1), P [E 1 ] , P [E 2 ] , P [E 4 ] ≥ 1 -e -d -e -n -n -9 . ( ) We next bound the terms in equation ( 12). As discussed, we set k = 1 and the stepsize as large as possible, i.e. η = 1/ N (1 + ϵ)(1 + σ 2 z ) ≤ 1/σ 2 y,1 , which holds on event E 4 . Finally, on event E 2 we obtain c(1 -ησ 2 y,d ) 2k ≤ c(1 -ηN (1 + σ 2 z )(1 -ϵ)) 2 = c 1 - 1 -ϵ 1 + ϵ 2 ≤ cϵ 2 . Next we bound the sum in equation ( 12). Towards this goal, we upper bound each term in the sum with its linear approximation at the origin. We compute the derivative at the origin as lim q→0 ∂ ∂q 1 q (1 -(1 -ηq) k ) 2 = (ηk) 2 . Thus, we have N i=d+1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 ≤ N i=d+1 k 2 η 2 σ 2 y,i ≤ N k 2 η 2 σ 2 y,d+1 ≤ σ 2 z + ϵ + ϵσ 2 z (1 + ϵ) 2 (1 + σ 2 z ) 2 ≤ c(σ 2 z + 2ϵ), on the event E 1 and for σ 2 z ≤ 1, k = 1 and η = 1/ N (1 + ϵ)(1 + σ 2 z ) . Putting this together and on the events E 2 , E 3 we get the bound R(W k ) ≤ 8 σ 2 z 1 + σ 2 z + cγ + c(1 -ησ 2 y,d ) 2k + c d ψγ + σ 2 z N N i=d+1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 ≤ 8 σ 2 z 1 + σ 2 z + cγ + cϵ 2 + cN d σ 2 z n log n N (d + nσ 2 z ) log n N + σ 2 z (σ 2 z + 2ϵ) ≤ 8R(W * ) + c (d + nσ 2 z ) log(n) N + cϵ 2 + c ((d + nσ 2 z ) log n) 2 dN + σ 2 z N d (σ 2 z + 2ϵ). The bound holds provided 3Cϵ -2 (d + σ 2 z n) log n ≤ N and N log(N ) ≤ n, for some constants c, C and with probability at least 1 -2e -N/8 -2e -N/18 -5n -9 -5e -d -2e -n -e -N -2e -n/2 . To maximize the benefit from early stopping we set ϵ as small as possible with respect to the condition 3Cϵ -2 (d + σ 2 z n) log n ≤ N ϵ = 3C(d + σ 2 z n) log n N . With that equation ( 19) becomes R(W k ) ≤ 8R(W * ) + c (d + nσ 2 z ) log(n) N + c ((d + nσ 2 z ) log n) 2 dN + σ 2 z N d σ 2 z + (d + σ 2 z n) log n N = 8R(W * ) + cγ + c γ 1 + nσ 2 z d log n + σ 2 z N d σ 2 z + √ γ ≤ 8R(W * ) + cγ 1 + nσ 2 z d log n + c σ 2 z N d σ 2 z + √ γ . ( ) We now consider the regime where N ≤ ξ d σ 2 z , for a numerical constant ξ, to simplify the statement further. For this regime, we have R(W k ) ≤ R(W * )(8 + 2ξ) + cγ 1 + nσ 2 z d log n + cξ √ γ. This concludes the proof of Theorem 2. G.1 PROOF OF EQUATION ( 9) In this Section, we derive a bound for 1 d (XV y D k U T y -I)U F the first term in (6). To this end, we introduce further notation. Recall that Y = U y Σ y V T y ∈ R n×N and X = U x Σ x V T x ∈ R n×N are the SVDs of Y and X respectively. All derivations below hold regardless of whether N ≤ n or N > n. Exemplarily, we will show them for N ≤ n. Thus, U y ∈ R n×N , Σ y ∈ R N ×N , V y ∈ R N ×N and U x ∈ R n×d , Σ x ∈ R d×d , V x ∈ R N ×d . Let U y1 , U x1 ∈ R n×d be the d leading left singular vectors of Y and X (note that U x1 = U x ), let U y2 , U x2 ∈ R n×n-d be their orthonormal complements. Let Ũy = [U y1 U y2 ] ∈ R n×n and Ũx = [U x1 U x2 ] ∈ R n×n . Analogous definitions can be made for the leading right singular vectors of Y and X.

Recall that D

k = diag . . . , (1 -(1 -ησ 2 y,i ) k )/σ y,i , . . . ∈ R N ×N . Let Dk = [D k 0] ∈ R N ×n and define Dk1 ∈ R d×d and Dk2 ∈ R N -d×n-d such that Dk = Dk1 0 0 Dk2 . With these definitions in place we can write (XV y D k U T y -I)U F = ŨT y (XV y Dk ŨT y -I) Ũy ŨT y U F = ( ŨT y XV y Dk -I) ŨT y U F = U T y1 XV y1 Dk1 U T y1 XV y2 Dk2 U T y2 XV y1 Dk1 U T y2 XV y2 Dk2 - I 0 0 I U T y1 U U T y2 U F ≤ (U T y1 XV y1 Dk1 -I)U T y1 U F + U T y2 XV y1 Dk1 U T y1 U F + U T y XV y2 Dk2 - 0 I U T y2 U F . The first term in ( 23) is bounded in the regime n ≥ N , for some constant c and with probability at least 1 -2e -n/2 by (U T y1 XV y1 Dk1 -I)U T y1 U F ≤ cσ z √ n d i=1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + d i=1 (1 -ησ 2 y,i ) 2k . (24) See Section G.1.1 for a proof. The second term in ( 23) is bounded for some constant c, with probability at least 1 -2n -10 -2e -d -e -n and in regime (d + nσ 2 z ) log(n) ≤ N by U T y2 XV y1 Dk1 U T y1 U F ≤ c √ γ d i=1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 . ( ) See Section G.1.2 for a proof. The third term in ( 23) is bounded for some constant c, with probability at least 1 -2e -N/18 -3n -10 -3e -d -e -n -e -N and in the regime (d + nσ 2 z ) log(n) ≤ N , N log(N ) ≤ n and 9d ≤ N by U T y XV y2 Dk2 - 0 I U T y2 U F ≤ c ψγ N i=d+1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + c dγ. See Section G.1.3 for a proof. Combining those results we can bound equation ( 23) in the regime N log(N ) ≤ n, 9d ≤ N and (d + nσ 2 z ) log(n) ≤ N , for some constant c and with probability at least 1 -2e -N/18 -3n -10 -3e -d -e -n -e -N -2e -n/2 as 1 d (XV y D k U T y -I)U 2 F ≤ cσ 2 z n d + c d σ 2 x,max γ d i=1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + c d d i=1 (1 -ησ 2 y,i ) 2k + c d ψγ N i=d+1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + cγ. This concludes the proof of equation ( 9).

G.1.1 PROOF OF EQUATION (24)

In the regime n ≥ N , for some constant c and with probability at least 1 -2e -n/2 , the first term in ( 23) is bounded by (U T y1 XV y1 Dk1 -I)U T y1 U F (i) ≤ U T y1 U x1 Σ x1 V T x1 V y1 Dk1 -I F ≤ U T y1 U x1 Σ x1 V T x1 V y1 Dk1 -Σ y1 Dk1 + Σ y1 Dk1 -I F ≤ U T y1 U x1 Σ x1 V T x1 V y1 -Σ y1 Dk1 F + Σ y1 Dk1 -I F (ii) ≤ cσ z √ n d i=1 1 σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + d i=1 (1 -ησ 2 y,i ) 2k . Inequality (i) uses U T y1 U ≤ 1. To obtain inequality (ii), we used that Dk1 = diag(. . . , (1 -(1 -ησ 2 y,i ) k )/σ y,i , . . .) ∈ R d×d and that U T y1 U x1 Σ x1 V T x1 V y1 -Σ y1 ≤ U x1 Σ x1 V T x1 -U y1 Σ y1 V T y1 ≤ U x Σ x V T x -U y Σ y V T y = ∥Z∥ ≤ cσ z √ n. Here, the last inequality holds in the regime n ≥ N , for some constant c and with probability at least 1 -2e -n/2 and follows from Section H.4, equation (64).

G.1.2 PROOF OF EQUATION (25)

For some constant c, with probability at least 1 -2n -10 -4e -d -e -n and in regime (d + nσ 2 z ) log(n) ≤ N the second term in ( 23) is bounded by U T y2 XV y1 Dk1 U T y1 U F ≤ U T y2 U x1 V T x1 V y1 U T y1 U ∥Σ x1 ∥ Dk1 F (i) ≤ U T y2 U x1 ∥Σ x1 ∥ Dk1 F (ii) ≤ c √ γ d i=1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 . ( ) Here, inequality (i) follows from V T x1 V y1 ≤ 1 and U T y1 U ≤ 1. For inequality (ii) we used that Dk1 = diag . . . , (1 -(1 -ησ 2 y,i ) k )/σ y,i , . . . ∈ R d×d and the bound on U T y2 U x1 from Section H.1, equation ( 35) that holds in the regime (d + nσ 2 z ) log(n) ≤ N , for some constant c and with probability at least 1 -2n -10 -2e -d -e -n .

G.1.3 PROOF OF EQUATION (26)

For some constant c, with probability at least 1 -2e -N/18 -3n -10 -3e -d -e -n -e -N and in the regime (d + nσ 2 z ) log(n) ≤ N and 9d ≤ N we obtain U T y XV y2 Dk2 - 0 I U T y2 U F = U T y1 XV y2 Dk2 U T y2 U U T y2 XV y2 Dk2 U T y2 U -U T y2 U F ≤ U T y1 U x1 Σ x1 V T x1 V y2 Dk2 U T y2 U F + U T y2 U x1 Σ x1 V T x1 V y2 Dk2 U T y2 U F + √ d U T y2 U (i) ≤ ( √ γ + γ)c ψ∥Σ x1 ∥ Dk2 F + c dγ (ii) ≤ c ψγ N i=d+1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 + c dγ. Here, inequality (i) follows from U T y1 U x1 ≤ 1 and the bound in Section H.1, equations (34), ( 35) and ( 36) and holds in the regime (d + nσ 2 z ) log(n) ≤ N , N log(N ) ≤ n and 9d ≤ N , for some constant c and with probability at least 1 -2e -N/18 -3n -10 -3e -d -e -n -e -N . Inequality (ii) holds in the regime (d + nσ 2 z ) log(n) ≤ N , which implies γ < 1 and we used the definition of Dk2 from equation ( 22). G.2 PROOF OF EQUATION (10) Recall that D k = diag . . . , (1 -(1 -ησ 2 y,i ) k )/σ y,i , . . . ∈ R N ×N . The second term in equation ( 6) can be bounded as σ 2 z d XV y D k U T y 2 F = σ 2 z d U x Σ x V T x V y D k U T y 2 F ≤ σ 2 z d σ 2 x,max V T x V y 2 ∥D k ∥ 2 F ≤ σ 2 z d N i=1 σ 2 x,max σ 2 y,i (1 -(1 -ησ 2 y,i ) k ) 2 , ( ) where we used that V T x V y 2 ≤ 1 and the definition of D k from equation ( 22).

H AUXILIARY PROOFS

In this Section we provide a summary of auxiliary proofs that are used to prove the main results in Sections F and G. H.1 APPLYING THE SIN-THETA THEOREM TO BOUND THE DISTANCE BETWEEN SUBSPACES In this Section, we use the following variant (Cai et al., 2015, Prop. 1) of the sin-theta theorem (Davis & Kahan, 1970) to bound the distances between subspaces occurring in the proofs in Section F and G. Proposition 1. Let Q and Q be n × n symmetric matrices. Let r < n be arbitrary and let U and Û be formed by the r leading singular vectors of Q and Q. Then Û ÛT -UU T ≤ 2 σ r (Q) -σ r+1 (Q) Q -Q . ( ) Recall that Y = U y Σ y V T y ∈ R n×N and X = U x Σ x V T x ∈ R n×N are the SVDs of Y and X respectively. Let U y1 , U x1 ∈ R n×d be the d leading left singular vectors of Y and X, let U y2 , U x2 ∈ R n×n-d be the orthonormal complements. Let Ũy = [U y1 U y2 ] ∈ R n×n and Ũx = [U x1 U x2 ] ∈ R n×n . Analogous definitions can be made for the leading right singular vectors of Y and X. We start by applying Proposition 1 to bound the distance between the subspaces spanned by the d leading left singular vectors U y1 of Y and the subspace model U as U T U y2 = U y1 U T y1 -UU T (i) ≤ 1 N YY T -UU T -σ 2 z I (ii) ≤ c(d + nσ 2 z ) log(n) N + c(d + nσ 2 z ) log(n) N (iii) ≤ c (d + nσ 2 z ) log(n) N , where inequality (i) follows from Proposition 1 and inequality (ii) holds with probability at least 1 -n -10 -2e -d + e -n and follows from Section H.2 equation (45). Inequality (iii) holds in the regime (d + nσ 2 z ) log(n) ≤ N . Next, we establish a bound for the distance between the subspaces spanned by the d leading left singular vectors of X and Y. We have that U T y2 U x1 = U y1 U T y1 -U x1 U T x1 ≤ U y1 U T y1 -UU T + U x1 U T x1 -UU T (i) ≤ 1 N YY T -UU T -σ 2 z I + 1 N XX T -UU T (ii) ≤ c(d + nσ 2 z ) log(n) N + c(d + nσ 2 z ) log(n) N + cd log(n) N + cd log(n) N ≤ c(d + nσ 2 z ) log(n) N + c(d + nσ 2 z ) log(n) N (iii) ≤ c (d + nσ 2 z ) log(n) N , where inequality (i) follows from Proposition 1 and inequality (ii) holds for some constant c and with probability at least 1 -2n -10 -2e -d + e -n and follows from the results in Section H.2 equations ( 45),(46). Inequality (iii) holds in the regime (d + nσ 2 z ) log(n) ≤ N . Last, we establish a bound for the distance between the subspaces spanned by the right singular vectors of X and Y. We have that with probability at least 1 -2e -N/18 -2n -10 -e -n -e -N -e -d V T x1 V y2 = V x1 V T x1 -V y1 V T y1 (i) ≤ 2 σ 2 x,d -σ 2 x,d+1 Y T Y -X T X (ii) ≤ c N Y T Y -X T X ≤ c 1 N Z T Z + 2c 1 N C T U T Z ≤ c 1 N ZZ T -σ 2 z I + c σ 2 z I + 2c 1 N C T U T Z (iii) ≤ cnσ 4 z log(n) N + cnσ 2 z log(n) N + cσ z log(N ) = cσ 2 z n log(n) N + cσ 2 z n log(n) N + c σ 2 z N log(N ) 2 N (iv) ≤ cσ 2 z n log(n) N + c σ 2 z n log(n) N (v) ≤ c nσ 2 z log(n) N . Inequality (i) follows from Proposition 1. Inequality (ii) follows from the fact that σ 2 x,d+1 = 0 and that 1 c N ≤ σ 2 x,d holds in the regime 9d ≤ N , with probability at least 1 -2e -N/18 and for some constant c (see Section H.4 equation ( 63)). Inequality (iii) holds for some constant c and with probability at least 1 -2n -10 -e -n -e -N -e -d and follows from the results in Section H.2 equations (47), (38). Inequality (iv) holds in the regime N log(N ) ≤ n. Inequality (v) holds in the regime nσ 2 z log(n) ≤ N . To abbreviate notation from now on we define ψ := nσ 2 z log(n) N .

H.2 BOUNDING A SUM OF INDEPENDENT RANDOM MATRICES

The Matrix Bernstein inequality (Oliveira, 2010; Tropp, 2012) can be used to bound a sum of independent, bounded and centered random matrices. We state the theorem below and then show how we applied it to bound several terms occurring in Sections F and G. Theorem 3 (Matrix Bernstein). Let S 1 , . . . , S n be independent, centered random matrices with common dimension d × d, and assume that each one is uniformly bounded Then with probability at least 1 -e -δ and δ ≥ 0 ∥Z∥ ≤ 2 3 L(δ + log(2d)) + 2v(Z)(δ + log(2d)). Recall our signal model to be Y = X + Z = UC + Z, with entries i.i.d. as c i,j ∼ N (0, 1) and z i,j ∼ N (0, σ 2 z ) and dimensions Y, X, Z ∈ R n×N , C ∈ R d×N . The subspace matrix U ∈ R n×d has orthonormal columns. We start by applying Theorem 3 to establish the following bound. With probability 1 -N -10e -N -e -d and for some unspecified numerical constant c we have 1 N C T U T Z ≤ cσ z log(N ). ( ) Proof of equation (38): We define Z = U T Z ∈ R d×N . Note that the entries in Z are independent and identically distributed like the entries in Z, i.e., zi,j ∼ N (0, σ 2 z ). Next we check the conditions of applying Theorem 3 to bound C T Z . Note that C T Z = d i=1 c i zT i , with c i , z[i] ∈ R N being the rows of C, Z. Since c i has zero mean, E c i zT i = 0, for all i = 1, . . . , d. Further, we have c i zT i = max ∥w∥ 2 =1 c i zT i w 2 = max ∥w∥ 2 =1 |z T i w|∥c i ∥ 2 = ∥z i ∥ 2 2 ∥z i ∥ 2 ∥c i ∥ 2 = ∥z i ∥ 2 ∥c i ∥ 2 ≤ cσ z N, where the inequality follows from equations ( 57),(56) and holds with probability at least 1-e -N -e -d . Finally we need to compute the matrix variance statistic v(C T Z). Note that E c i zT i (c i zT i ) T = E zT i zi c i c i T = σ 2 z N I = E (c i zT i ) T c i zT i . and therefore v(C T Z) = max d i=1 E c i zT i (c i zT i ) T , d i=1 E (c i zT i ) T c i zT i = σ 2 z dN. (42) With equations ( 40),( 42) in place we are ready to apply Theorem 3 to obtain with probability at least 1 -N -10 -e -N -e -d and some constant c 1 N C T U T Z ≤ cσ 2 z d log(N ) N + cσ z log(N ) ≤ cσ z log(N ), (43) where the last inequality holds since d < N . This concludes the proof of equation (38). In the remainder of this Section we apply the example in Tropp (2015, Sec. 1.6.3 ) that illustrates how to use Theorem 3 to bound the distance between sample and true covariance matrices. With probability at least 1 -n -10 and an unspecified numerical constant c we have 1 N N i=1 a i a i T -E aa T ≤ cB∥E [aa T ]∥ log(n) N + cB log(n) N ,



Figure 1: Empirical scaling laws for CNN-based image reconstruction. Reconstruction performance of a U-Net for denoising (a) and accelerated MRI (c) as a function of the training set size N . In both experiments an initial steep power law R = βN α transitions to a relatively flat one already at moderate N . Thus we expect that training on millions of images does not significantly improve performance. To obtain the scaling curves in (a),(c) we optimize over the number of network parameters as shown in (b),(d). Colors in the plots on the left and right correspond to the same training set size. Since for large training set sizes corresponding to the flattened scaling law, increasing the parameters does not boost performance further, the decay in scaling coefficients is a robust finding.

Zhang et al. (2017) report that for DnCNN, a standard CNN, using more than 400 distinct images with data augmentation only yields negligible improvements.Chen et al. (2021) pre-train an image processing transformer (IPT) of 115.5M network parameters on ImageNet (1.1M distinct images), and report the performance after fine-tuning to a specific task as a function of the size of the pretraining dataset. IPT's performance for denoising is surpassed by the latest SOTA in form of the CNN based DRUnet(Chen et al., 2021), the transformer based SwinIR(Liang et al., 2021) and the Swin-Conv-Unet(Zhang et al., 2022) a combination of the two. Those models have significantly fewer network parameters and were trained on a training set consisting of only ∼10k images, leaving the role of training set size in image denoising open.

Net and SwinIR we vary the number of network parameters as depicted in Fig. 1 (b) and Fig. 2 (b) respectively. Appx. A.1 and A.3 contain detailed descriptions on how the models are trained and the network size is adapted. Results and discussion. Fig. 1(a) and Fig. 2(a) show the reconstruction performances of the best U-Net and SwinIR respectively over all considered network sizes. Our main findings are as follows.

Now consider the estimator W PCA = 1 1+σ 2 z Û ÛT y based on N noisy training points. The estimator assumes knowledge of the noise variance, but it is not difficult to estimate it relatively accurately. The following result, proven in Appx. F, characterizes the associated risk. Theorem 1. Suppose that the number of training examples obeys

Figure 3: Subspace denoising. Simulated risks of the early stopped ERM estimator W kopt (left) and the PCA estimator W PCA (right) over the training set size N . The signal and ambient dimensions are d = 10, n = 1000 and we vary the noise level σ z . We fit linear scaling laws ∼ N α to the power law regions. As the noise level decreases the scaling coefficients α increase. Further, the learned estimator exhibits steeper scaling than the PCA estimator. Error bars are over 5 independent runs.

(a). Despite of the best SwinIR clearly outperforming the best U-Net in terms of PSNR Published as a conference paper at ICLR

Figure 4: Reconstructions along the scaling law for denoising with a U-Net and SwinIR. The two examples illustrate how the reconstruction quality improves as the training set size N increases. First rows: ground truth and reconstruction, Second rows: residuals w.r.t. the ground truth.

we train the SwinIR for color denoising fromLiang et al. (2021) on the same training sets mentioned in Appx. A.1 with {0.1, 0.3, 1, 10, 100} thousand images from ImageNet. Instead of center cropping the training images to 256 × 256 we have to crop to 128 × 128 pixels as larger input patches would make it computational infeasible for us to train large versions of the SwinIR. The largest SwinIR alone took over 2 months to train on 4 NVIDIA A40 GPUs.

We use the same training, validation and test sets as described in Appx. A.1 with training set sizes N ∈ [100, 100k] images from ImageNet. On top we add two larger training sets of size 300k and 600k. Instead of center cropping the training images to 256 × 256 we follow the super-resolution experiments in Liang et al. (2021) and train on images cropped to 128 × 128 pixels. We consider super-resolution of factor 2 so the low-resolution images have size 64 × 64. The low-resolution images are obtained with the bicubic downsampling function of Python's PIL.Image package.

Results and discussion. For each training set size, Fig. 6(a) shows the reconstruction performance in PSNR of the best model over all simulated network sizes in Fig. 6(b). Since the curves per training set size in Fig. 6(b) are relatively flat, further scaling up the network size is not expected to significantly improve the performance on the studied training sets. Here are the two main findings:

Figure 6: Empirical scaling laws for super-resolution with a U-Net. The curve in (a) contains the best reconstruction performances per training set size over the different network sizes depicted in (b). Colors in the plot on the left and right correspond to the same training set size. While the scaling laws for super-resolution do not slow as early as those for Denoising and Compressive sensing (see Fig.1), they are likely to slow further at a larger number of training examples

Figure 7: Empirical scaling laws for denoising with fixed noise. The colored curve in (a) contains the best reconstruction performances per training set size over the different network sizes depicted in (b) for fixed noise realizations during training. Colors in the plot on the left and right correspond to the same training set size. The gray curve is taken from Fig. 1(a) and shows the performance, when the noise is re-sampled during training. The initial drop in performance due to fixing the noise during training reduces as the training set size increase resulting in a slightly steeper scaling compared to re-sampling the noise. Yet, we expect also the scaling of the performance of training with fixed noise to flatten as it approaches the performance of re-sampling the noise.

Figure 8: Comparison of empirical scaling laws for denoising with noise level 15 and 25. The colored curves in (a) and (c) contain the best reconstruction performances per training set size over the different network sizes depicted in (b) and (d) for noise level 15 and 25 respectively. Colors in the plot on the left and right correspond to the same training set size. The curves in (c),(d) are taken from Fig. 1(a),(b).

Figure 9: Empirical scaling laws for denoising with a smaller patch size. The colored curve in (a) contains the best reconstruction performances per training set size over the different network sizes depicted in (b) for training patches of size 128 × 128. Colors in the plot on the left and right correspond to the same training set size. The gray curve is taken from Fig. 1(a) and shows the performance for training patches of size 256 × 256. Note that the x-axis shows the number of training patches of size 128 × 128. Hence, one patch of size 256 × 256 is worth 4 patches of size 128 × 128.We see that in the regime of large training set sizes larger patches performs better than more but smaller patches.

Figure10: Effect of early stopping the learned estimator. The risk of the early stopped learned estimator W kopt and converged estimator W ∞ as a function of the training set size N measured in simulations. While both estimators approach the optimal performance R(W * ) for large N , early stopping is critical for performance in the regime N ≈ n. We consider the setup d = 10, n = 100, σ z = 0.05. Error bars are over 5 independent runs.

Figure 11: Additional numerical results for subspace denoising. Left and right show the simulated risk of the early stopped empirical risk minimizer W kopt and the PCA estimator W PCA as a function of the training set size N . In the upper part we fix the signal dimension and noise level d = 10, σ z = 0.1 and vary the ambient dimension n.In the lower part we fix the ambient dimension and noise level n = 1000, σ z = 0.1 and vary the signal dimension d. We fit linear scaling laws ∼ N α to the power law regions. Similar to Fig.3we observe that with varying model parameters the scaling coefficients α change. Further, the learned estimator exhibits steeper scaling than the PCA estimator over all settings. Error bars are over 5 independent runs.

[S k ] = 0 and ∥S k ∥ ≤ L for each k = 1, . . . , n.and let v(Z) denote the matrix variance statistics of the sum:v(Z) = max E ZZ T , E Z T Z

The role of training set size in inverse problems. For image reconstruction and inverse problems in general, we are not aware of work studying scaling laws in a principled manner, covering different Empirical scaling laws for transformer-based image denoising. The colored curve in (a) shows the best PSNR per training set size of a SwinIR(Liang et al., 2021) with varying network sizes as depicted in (b). Colors in the plot on the left and right correspond to the same training set size. The SwinIR outperforms the U-Net (gray curve from Fig.1(a)), but as for the U-Net the rate of improvement slows to a level that indicates that further increasing the dataset size would only marginally improve performance.

Batch size and maximal number of steps for every experiment in Fig.2. Each experiment can be identified by the number of training examples N in thousands and the network size small(S)/middle(M)/large(L)/huge(H).

List of used network configurations of the SwinIRLiang et al. (2021). We consider four network sizes small(S)/middle(M)/large(L)/huge(H) by varying the number of residual Swin Transformer blocks, the number of Swin Transformers per block, the number of attention heads per Swin Transformer, the number of channels in the input embedding and the width of the fully connected layers in a Swin Transformer.

Benchmarking results for Gaussian color image denoisng. Average PSNR on 4 common test sets of our best U-Net (46.5M parameters, 100k training images) and three different versions of the SwinIR (see Appx. A.3). Values for the original SwinIR and SCUnet are taken fromLiang et al. (2021) andZhang et al. (2022). Best and second best performance are in red and blue colors respectively.In this Section, we evaluate the models we trained for image denoising in Section 3 on four common test sets form the literature and show that the largest SwinIR trained on the largest dataset achieves new SOTA for all four test sets and the considered noise level.

ACKNOWLEDGMENTS

The authors acknowledge support by the Institute of Advanced Studies at the Technical University of Munich, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -456465471, 464123524, the DAAD, and the German Federal Ministry of Education and Research and the Bavarian State Ministry for Science and the Arts. The authors of this work take full responsibility for its content.

REPRODUCIBILITY

The repository at https://github.com/MLI-lab/Scaling_Laws_For_Deep_ Learning_Based_Image_Reconstruction contains the code to reproduce all results in the main body of this paper.Early-stopped empirical risk minimization. We consider the estimator that applies gradient descent to the empirical riskwhere X, Y ∈ R n×N contain the training examples as columns, and early-stops after k iterations for regularization.We next discuss the early-stopped estimator W k in more detail. Fig. 10 numerically demonstrates the regularizing effect of early stopping gradient descent, where W ∞ = XY † is the converged learned estimator (see Appx. G, Eq. ( 5)). We see that regularization is necessary for this estimator to perform well.We next discuss Theorem 2 and the associated assumptions in more detail. Theorem considers the following regime: (i) the number of training examples obeys (d + nσ 2 z ) log(n) ≤ N and (ii) N ≤ ξd/σ 2 z , for an arbitrary ξ, and (iii) N log(N ) ≤ n. For this regime, the theorem guarantees that the risk of the optimally early-stopped estimator obeys, with high probability,(3) The theorem looks similar to that for the PCA estimate (Theorem 1), in that the risk is a constant away from the optimal risk, with an error term that becomes small as (d + nσ 2 z )/N becomes small. However, the error bound does not converge to R(W * ) as the number of training examples, N , converges to infinity. This is probably an artifact of our analysis, but it is unclear, at least to us, how to derive a substantially tighter bound. In our analysis (see appendix G), we balance two errors: One error decreases in k and is associated with the part of the signal projected into the subspace, and the second error increases with k and is associated with the orthogonal complement of the subspace. We choose the early-stopping time to optimally balance those two terms, which yields the stated bound (3). Now with regards to the assumption: Assumption (iii) N log(N ) ≤ n means we are in the highdimensional regime; we think this is somewhat closer to reality (for example for denoising a 512×512 image, this would require the number of training examples to be smaller than 250k), but we can derive an analogous bound for the regime N log(N ) ≥ n, where the number of training examples is larger than the ambient dimension.Assumption (i) (d + nσ 2 z ) log(n) ≤ N is relatively mild, as it is necessary to being able to somewhat accurately estimate the subspace; this assumption is also required for the PCA estimate. Assumption (ii) N ≤ ξd/σ 2 z , for an arbitrary ξ, is not restrictive in that ξ can be arbitrarily large, we make this assumption only so that the theorem can be stated in a convenient way. However, assumption (ii) reveals a shortcoming of Theorem 2 which is that we cannot make the bound go to zero as N → ∞, since increasing ξ increases one term in the bound, and decreases another one.

E.1 ADDITIONAL NUMERICAL SIMULATIONS

In this Section we provide further numerical simulations for the PCA subspace estimator and the estimator learned with early stopped gradient descent discussed in Section 5, Theorem 1 and Theorem2. Similar to Fig. 3 , Fig. 11 shows the risks R(W kopt ) and R(W PCA ) as a function of the number of training examples N for varying values of the signal and ambient dimension d and n, while fixing all other model parameters. In the power law region we fit linear power laws with negative scaling coefficients α. We observe steeper power laws (larger |α|) for smaller ambient dimensions n and larger signal dimensions d. Also the scaling coefficients of the learned estimator consistently excel the coefficients from the PCA estimator.

F PROOF FOR THEOREM 1: RISK BOUND FOR PCA SUBSPACE ESTIMATION

We provide a bound on the risk of the estimator f (y) = W PCA y with W PCA = τ Û ÛT and τ = 1 1+σ 2 z . Recall that Û ∈ R n×d contains the singular vectors corresponding to the d-leading where we assume that the l 2 norm of the random vector a ∈ R n is bounded ∥a∥ 2 2 ≤ B.In the following we show how to apply (44) to establish the following bounds. With probability at least 1 -n -10 -2e -d + e -n and for some unspecified numerical constant cWith probability at least 1 -n -10 -e -d and for some unspecified numerical constant cWith probability at least 1 -n -10 -e -n and for some unspecified numerical constant cIn the notation of the general example in (44), the proofs of ( 45)-( 47) consist of deriving expressions for ∥a∥ 2 2 ≤ B and E aa T respectively.Proof of equation (45): We havefor some constant c and where the first inequality holds with probability at least 1 -2e -d -e -n as it is shown in Section H.3. Further, we haveSince practical noise levels satisfy σ 2 z ≤ 1, there exists some constant c such that ∥y∥with probability at least 1 -2e -d -e -n . Inserting ( 48) and (50) in the general form in (44) concludes the proof of equation ( 45).Proof of equation ( 46): We havefor some constant c and where the inequality holds with probability at least 1 -e -d as it is shown in Section H.3. Further, we have(52) Inserting ( 51) and (52) in the general form in (44) concludes the proof of equation ( 46).

Proof of equation (47): We have

for some constant c and where the inequality holds with probability at least 1 -e -n as it is shown in Section H.3. Further, we have(54) Inserting ( 53) and ( 54) in the general form in (44) concludes the proof of equation ( 47).

H.3 TAIL BOUNDS FOR INNER PRODUCTS OF GAUSSIAN VECTORS

Recall that c ∼ N (0, I) ∈ R d and z ∼ N (0, σ 2 z I) ∈ R n are the columns of C, Z respectively. Also U ∈ R n×d has orthonormal columns. In this section we state some relations on the concentration of the inner products between those vectors. The chi-squared distributed c T c can be bounded with probability at most e -t asSubstituting t with the signal dimension d giveswith probability at most e -d . Using the same result we can writewith probability at most e -n and where z ′ ∼ N (0, I). Next we show that with probability at mostTo this end, note that for any (deterministic) vector c we have z T Uc ∼ N (0, ∥c∥2 ), and we apply a simple tail bound for Gaussian random variables to getwith probability at most e -t 2 . It is straightforward to see that substituting t with √ d, applying (56) to bound ∥c∥ 2 and combining the results with a union bound results into the bound (58). We can combine the results from this section to bound the sumwith probability at most 2e -d + e -n .

H.4 BOUNDING THE EXTREME SINGULAR VALUES OF GAUSSIAN RANDOM MATRICES AND EMPIRICAL COVARIANCE MATRICES

In this Section we state results for the extreme singular values of some of the matrices occurring in Sections F and G. Specifically, we establish equations ( 17),( 18),( 29) and (36) (ii).A standard deviation inequality for the extreme singular values of some matrix A ∈ R M ×m with independent and identically standard normal distributed entries implies that (Rudelson & Vershynin, 2010, equation (2.3 )with probability at least 1 -2e -t 2 /2 for t ≥ 0.We start with the largest singular value σ x,max of the feature matrix X ∈ R n×N . Recall that we have X = UC with orthonormal U ∈ R n×d and C ∈ R d×N with i.i.d. entries c i,j ∼ N (0, 1). Hence, the singular values of UC are the singular values of C. which holds with probability at least 1 -2e -N/8 . This establishes equation ( 17). For σ x,min the smallest non-zero singular value of X we have σ x,min = σ x,d = σ c,d . In the regime 9d ≤ N and choosing t = √ N /3 we obtainwhich holds with probability at least 1 -2e -N/18 and for some constant c. This establishes (36) (ii).Next, we state a bound on the largest singular value ∥Z∥ of the noise matrix Z ∈ R n×N with i.i.d. entries z i,j ∼ N (0, σ 2 z ). Note that Z = σ z Z, where the entries of Z follow zi,j ∼ N (0, 1). Applying equation ( 61) with t =with probability at least 1 -2e -n/2 and some constant c. For the last inequality we assumed n ≥ N . This establishes equation ( 29).Finally, we derive bounds for some of the squared singular values of Y ∈ R n×N . In particular, we show thatandand finallyboth hold with probability at least 1 -e -d -e -n -n -9 and for some constants C and ϵ ∈ (0, 1) in the regime N ≥ 3Cϵ -2 (d + σ 2 z n) log n. This establishes equation (18). To this end, we rely on the following Corollary (Vershynin, 2011, Corollary 5.52) . Corollary 1 (Covariance estimation for arbitrary distributions). Let x be a random vector in R n supported in some centered Euclidean ball whose radius we denote √ m. Consider N independent samples x i arranged as columns of the random matrix A ∈ R n×N . Denote Σ N = 1 N AA T as the sample covariance matrix and Σ as the true covariance matrix. Let ϵ ∈ (0, 1) and t ≥ 1. Then the following holds with probability at least 1 -n -t 2 :Here C is an absolute constant.We apply Corollary 1 to the random vectors y = Uc + z. Note that E yy T = UU T + σ 2 z I. Further note that E ∥y∥where the second inequality follows from Section H.3, ( 56), (57) and holds with probability at least 1 -e -d -e -n . Now we can apply Corollary 1 to make the following statement. For some constant C and ϵ ∈ (0, 1) if N ≥ Ctϵ -2 (d + σ 2 z n) log n, then with probability at least 1 -e -d -e -n -n -t 2 1 N YY T -E yy T ≤ ϵ E yy T = ϵ(1 + σ 2 z ).(69)Consequently the singular values σ i ( 1 N YY T ) and σ i (E yy T ) differ by at most ϵ(1 + σ 2 z ) and we can boundand further σ 2 y,d+1 = N σ y,d+1which concludes the proof of equations ( 65) and (66).In the same manner we boundwhich concludes the proof of (67).

