EARLY STOPPING FOR DEEP IMAGE PRIOR

Abstract

Deep image prior (DIP) and its variants have shown remarkable potential for solving inverse problems in computational imaging (CI), needing no separate training data. Practical DIP models are often substantially overparameterized. During the learning process, these models first learn the desired visual content and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP hinges on early stopping (ES) that can capture the transition period. In this regard, most previous DIP works for CI tasks only demonstrate the potential of the models-reporting the peak performance against the groundtruth but providing no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy that consistently detects near-peak performance across several CI tasks and DIP variants. Simply based on the running variance of DIP intermediate reconstructions, our ES method not only outpaces the existing ones-which only work in very narrow regimes, but also remains effective when combined with methods that try to mitigate overfitting.

1. INTRODUCTION

Inverse problems (IPs) are prevalent in computational imaging (CI), ranging from basic image denoising, super-resolution, and deblurring, to advanced 3D reconstruction and major tasks in scientific and medical imaging (Szeliski, 2022) . Despite the disparate settings, all these problems take the form of recovering a visual object x from y = f (x), where f models the forward process to obtain the observation y. Typically, these visual IPs are underdetermined: x cannot be uniquely determined from y. This is exacerbated by potential modeling (e.g., linear f to approximate a nonlinear process) and observational (e.g., Gaussian or shot) noise, i.e., y ≈ f (x). To overcome the nonuniqueness and improve noise stability, people often encode a variety of problem-specific priors on x when formulating IPs. Traditionally, IPs are phrased as regularized data-fitting problems: min x ℓ(y, f (x)) + λR(x) ℓ(y, f (x)) : data-fitting loss, R(x) : regularizer (1) where λ is the regularization parameter. Here, the loss ℓ is often chosen according to the noise model, and the regularizer R encodes priors on x. The advent of deep learning (DL) has revolutionized how IPs are solved: on the radical side, deep neural networks (DNNs) are trained to directly map any given y to an x; on the mild side, pretrained or trainable DL models are taken to replace certain nonlinear mappings in numerical algorithms for solving Eq. ( 1) (e.g., plug-and-play, and algorithm unrolling). Recent surveys Ongie et al. (2020) ; Janai et al. (2020) on these developments trust large training sets {(y i , x i )} to adequately represent the underlying priors and/or noise distributions. This paper concerns another family of striking ideas that require no separate training data. Deep image prior (DIP) Ulyanov et al. (2018) proposes parameterizing x as x = G θ (z), where G θ is a trainable DNN parametrized by θ and z is a trainable or frozen random seed. No separate training data other than y are used! Putting the reparametrization into Eq. ( 1), we obtain min θ ℓ(y, f • G θ (z)) + λR • G θ (z). ( ) G θ is often "overparameterized"-containing substantially more parameters than the size of x, and "structured"-e.g., consisting of convolution networks to encode structural priors in natural visual objects. The resulting optimization problem is solved via standard first-order methods for modern DL (e.g., (adaptive) gradient descent). When x has multiple components with different physical meanings, one can naturally parametrize x using multiple DNNs. This simple idea has led to surprisingly competitive results on numerous visual IPs, from low-level image denoising, super-resolution, inpainting (Ulyanov et al., 2018; Heckel & Hand, 2019; Liu et al., 2019) and blind deconvolution (Ren et al., 2020; Wang et al., 2019; Asim et al., 2020; Tran et al., 2021; Zhuang et al., 2022a) , to mid-level image decomposition and fusion (Gandelsman et al., 2019; Ma et al., 2021) , and to advanced CI problems (Darestani & Heckel, 2021; Hand et al., 2018; Williams et al., 2019; Yoo et al., 2021; Baguer et al., 2020; Cascarano et al., 2021; Hashimoto & Ote, 2021; Gong et al., 2022; Veen et al., 2018; Tayal et al., 2021; Zhuang et al., 2022b) ; see the survey Qayyum et al. (2021) . Figure 1 : The "early-learning-then-overfitting" (ELTO) phenomenon in DIP for image denoising. The quality of the estimated image climbs to a peak first and then plunges once the noise is picked up by the model G θ (z) also. Overfitting issue in DIP A critical detail that we have glossed over is overfitting. Since G θ is substantially overparameterized, G θ (z) can represent arbitrary elements in the x domain. Global optimization of (2) would normally lead to y = f (G θ (z)), but G θ (z) may not reproduce x, e.g., when f is non-injective, or y ≈ f (x) so that G θ (z) also accounts for the modeling and observational noise. Fortunately, DIP models and first-order optimization methods together offer a blessing: in practice, G θ (z) has a bias toward the desired visual content and learns it much faster than learning noise. So the reconstruction quality climbs to a peak before potential degradation due to noise; see Fig. 1 . This "early-learning-then-overfitting" (ELTO) phenomenon has been repeatedly reported in prior works and is also backed by theories on simple G θ and linear f (Heckel & Soltanolkotabi, 2020b; a) . The successes of DIP models claimed above are mostly conditioned on that appropriate early stopping (ES) around the performance peaks can be made. Is ES for DIP trivial? Natural ideas trying to perform good ES can fail quickly. (1) Visual inspection: This subjective approach is fine for small-scale tasks involving few problem instances, but quickly becomes infeasible for many scenarios, such as (a) large-scale batch processing, (b) recovery of visual contents tricky to be visualized and/or examined by eyes (e.g., 3D or 4D visual objects), and (c) scientific imaging of unfamiliar objects (e.g., MRI imaging of rare tumors, and microscopic imaging of new virus species); (2) Tracking full-reference/no-reference image quality metrics (FR/NR-IQMs): Without the groundtruth x, computing any FR-IQM and hence tracking their trajectories (e.g., the PNSR curve in Fig. 1 ) is out of the question. We consider tracking NR-IQMs as a family of baseline methods in Sec. 3. 

Our contribution

We advocate the ES approach-the iteration process stops once a good ES point is detected, as ( 1) the regularization and noise modeling approaches, even if effective, often do not improve the peak performance but push it until the last iterations; there could be ≥ 10× more iterations spent than that of climbing to the peak in the original DIP models; (2) both need deep knowledge about the noise type/level, which is practically unknown for most applications. If their key models and hyperparameters are not set appropriately, overfitting probably remains. Then ES is still needed. In this paper, we build a novel ES criterion for various DIP models simply by tracking the trend of the running variance of the reconstruction sequence. Our ES method is (1) Effective: The gap between our detected and the peak performance, i.e., detection gap, is typically very small, as measured by standard visual quality metrics (PSNR and SSIM); (2) Efficient: Periteration overhead is a fraction of-the standard version in Algorithm 1, or negligible-the variant in Algorithm 2, relative to the per-iteration cost of Eq. ( 2); (3) General: Our method works well for DIP and its variants, including deep decoder (Heckel & Hand, 2019, DD) and sinusoidal representation networks (Sitzmann et al., 2020, SIREN) , on different noisy types/levels and across 5 visual IPs, spanning both linear and nonlinear. Also, our method can be wrapped around several regularization methods, e.g., Gaussian process-DIP (Cheng et al., 2019 , GP-DIP), DIP with total variation regularization (Liu et al., 2019; Cascarano et al., 2021, DIP-TV) to perform reasonable ES when they fail to prevent overfitting; (4) Robust: Our method is relatively insensitive to the two hyperparameters, i.e., window size and patience number (see Secs. 2, 3 and 3.4 and Appendix A.7.13) . By contrast, the hyperparameters of most methods reviewed above are sensitive to the noise type/level.

2. OUR EARLY-STOPPING METHOD

Figure 2 : Relationship between the PSNR, MSE, and VAR curves. Our method relies on the VAR curve, whose valley is often well aligned with the MSE valley, to detect the MSE valley-that corresponds to the PSNR peak. Intuition for our method We assume: x is the unknown groundtruth visual object of size N , {θ t } t≥1 is the iterate sequence, and {x t } t≥1 the reconstruction sequence where x t . = G θ t (z). Since we do not know x, we cannot access the PNSR or any FR-IQM curve. But we observe that (Fig. 2 ) generally the MSE (resp. PSNR; recall PSNR(x t ) = 10 log 10 ∥x∥ 2 ∞ /MSE(x t )) curve follows a U (resp. bell) shape: ∥x t -x∥ 2 F initially drops quickly to a low level, and then climbs back due to the noise effect, i.e., the ELTO phenomenon in Sec. 1; we hope to detect the valley of this U-shaped MSE curve. Then how to gauge the MSE curve without knowing x? We consider the running variance (VAR): VAR(t) . = 1/W • W -1 w=0 ∥x t+w -1/W • W -1 i=0 x t+i ∥ 2 F . Initially, the models quickly learn the desired visual content, resulting in a monotonic, rapidly decreasing MSE curve (see Fig. 2 ). So we expect the running variance of {x t } t≥1 to also drop quickly, as shown in Fig. 2 . When the iteration is near the MSE valley, all the x t 's are near but scattered around x. So 1 W W -1 i=0 x t+i ≈ x and VAR(t) ≈ 1 W W -1 w=0 ∥x t+w -x∥ 2 F . Afterward, the noise effect kicks in and the MSE curve bounces back, leading to a similar bounce-back in the VAR curve as the x t sequence gradually moves away from x. k = k + 1 14: end while Detecting transition by running variance Our lightweight method only involves computing the VAR curve and numerically detecting its valley-the iteration stops once the valley is detected. To obtain the curve, we set a window-size parameter W and compute the windowed moving variance (WMV). To robustly detect the valley, we introduce a patience number P to tolerate up to P consecutive steps of variance stagnation. Obviously, the cost is dominated by the variance calculation per step, which is O(W N ) (N is the size of the visual object). In comparison, a typical gradient update step for solving Eq. (2) costs at least Ω(|θ|N ), where |θ| is the number of parameters in the DNN G θ . Since |θ| is typically much larger than W (default: 100), our running VAR and detection incur very little computational overhead. Our whole algorithmic pipeline is summarized in Algorithm 1. To confirm the effectiveness, we provide sample qualitative results in Figs. 3 and 11 , with more quantitative results included in the experiment part (Sec. 3; see also Tab. 1). Appendix A.7.3 shows on image denoising with different noise types/levels, our ES method can detect near-peak ES points. Similarly, our method remains effective on several popular DIP variants, as shown in Fig. 3 . Seemingly similar ideas Our running variance and its U-shaped curve are reminiscent of the classical U-shaped bias-variance tradeoff curve and hence validation-based ES (Geman et al., 1992; Yang et al., 2020) . But there are crucial differences: (1) our learning setting is not supervised; (2) the variance in supervised learning is with respect to sample distribution, whereas our variance here pertains to the {x t } t≥1 sequence. As discussed in Sec. 1, we cannot directly apply validation-based ES, although it is possible to heuristically emulate it by splitting the elements in y (Yaman et al., 2021; Ding et al., 2022 )-which might be problematic for nonlinear IPs. Another line of related ideas is variance-based online change-point detection in time series analysis (Aminikhanghahi & Cook, 2017) , where running variance is often used to detect mean-shift assuming the means are piecewise constant. Here, the piecewise constancy assumption does not hold for our {x t } t≥1 . 2020b)) in understanding DNNs. The idea is based on the assumption that during DNN training θ does not move much away from initialization θ 0 , so that the learning dynamic can be approximated by that of a linearized model, i.e., suppose that we take the MSE loss ∥y -G θ (z)∥ 2 2 ≈ y -G θ 0 (z) -J G θ 0 θ -θ 0 2 2 . = f (θ), where J G θ 0 is the Jacobian of G with respect to θ at θ 0 , and G θ 0 (z) + J G θ 0 θ -θ 0 is the first-order Taylor approximation to G θ (z) around θ 0 . f (θ) is simply a least-squares objective. We can directly calculate the running variance based on the linear model, as shown below. Theorem 2.1. Let σ i 's and w i 's be the singular values and left singular vectors of J G (θ 0 ), and suppose we run gradient descent with step size η on the linearized objective f (θ) to obtain {θ t } and {x t } with x t . = G θ 0 (z) + J G (θ 0 )(θ t -θ 0 ). Then provided that η ≤ 1/ max i (σ 2 i ), VAR(t) = i C W,η,σi ⟨w i , y⟩ 2 1 -ησ 2 i 2t , where y = y -G θ 0 (z), and C W,η,σi ≥ 0 only depends on W , η, and σ i for all i. The proof can be found in Appendix A.2. Theorem 2.1 shows that if the learning rate (LR) η is sufficiently small, the WMV of {x t } is monotonically decreasing. We can develop a complementary upper bound for the WMV that does have a U shape. To this end, we make use of Theorem 1 of Heckel & Soltanolkotabi (2020b) , which can be summarized (some technical details omitted; precise statement reproduced in Appendix A.3) as follows: consider the two-layer model G C (B) = ReLU(U BC)v, where C ∈ R n×k models 1 × 1 trainable convolutions, v ∈ R k×1 contains fixed weights, U is an upsampling operation, and B is the fixed random seed. Let J be a reference Jacobian matrix solely determined by the upsampling operation U , and σ i 's and w i 's the singular values and left singular vectors of J . Assume x ∈ span {w 1 , . . . , w p }. Then, when η is sufficiently small, with high probability, ∥G C t (B) -x∥ 2 ≤ 1 -ησ 2 p t ∥x∥ 2 + E(n) + ε∥y∥ 2 , where ε > 0 is a small scalar related to the structure of the network and E(n) is the error introduced by noise: E 2 (n) . = n j=1 ((1 -ησ 2 j ) t -1) 2 ⟨w j , n⟩ 2 . So if the gap σ p /σ p+1 > 1, ∥G C t (B) -x∥ 2 is dominated by 1 -ησ 2 p t ∥x∥ 2 when t is small, and then by E(n) when t is large. But since the former decreases and the latter increases when t grows, the upper bound has a U shape with respect to t. Based on this result, we have: Theorem 2.2. Assume the same setting as Theorem 2 of Heckel & Soltanolkotabi (2020b) . With high probability, our WMV is upper bounded by 12 W ∥x∥ 2 2 1 -ησ 2 p 2t 1 -(1 -ησ 2 p ) 2 + 12 n i=1 1 -ησ 2 i t+W -1 -1 2 (w ⊺ i n) 2 + 12ε 2 ∥y∥ 2 2 . ( ) Figure 4 : The exact and upper bounds predicted by Theorems 2.1 and 2.2. The exact statement and proof can be found in Appendix A.3. By similar reasoning as above, we can conclude that the upper bound in Theorem 2.2 also has a U shape. To interpret the results, Fig. 4 shows the curves (as functions of t) predicted by Theorems 2.1 and 2.2. The actual VAR curve should lie between the two curves. These results are primitive and limited, simiar to the situations for many DL theories that provide untight upper and lower bounds; we leave a complete theoretical justification as future work. A memory-efficient variant While Algorithm 1 is already lightweight and effective in practice, we can slightly modify it to avoid maintaining Q and hence save memory. The trick is to use exponential moving variance (EMV), together with the exponential moving average (EMA), shown in Appendix A.4. The hard window size parameter W is now replaced by the soft forgetting factor α: the larger the α, the smaller the impact of the history, and hence a smaller effective window. We compare ES-WMV with ES-EMV in Appendix A.7.11 systematically for image denoising tasks. The latter has slightly better detection due to the strong smoothing effect (α = 0.1). For this paper, we prefer to remain simple and leave systematic evaluations of ES-EMV on other IPs as future work.

3. EXPERIMENTS

Figure 5 : Baseline ES vs our ES-WMV on denoising with low-level noise. For NIMA, we report both technical quality assessment (NIMA-q) and aesthetic assessment (NIMA-a). Smaller PSNR gaps are better. We test ES-WMV for DIP on image denoising, inpainting, super-resolution, MRI reconstruction, and blind image deblurring, spanning both linear and nonlinear IPs. For image denoising, we also systematically evaluate ES-WMV on major variants of DIP, including DD (Heckel & Hand, 2019) , DIP-TV (Cascarano et al., 2021) , GP-DIP (Cheng et al., 2019) , and demonstrate ES-WMV as a reliable helper to detect good ES points. Details of the DIP variants are discussed in Appendix A.5. We also compare ES-WMV with major competing methods, including DF-STE (Jo et al., 2021) , SV-ES (Li et al., 2021) , DOP (You et al., 2020) , SB (Shi et al., 2022) , and VAL (Yaman et al., 2021; Ding et al., 2022) . Details of major ESbased methods can be found in Appendix A.6. We use both PSNR and SSIM to access the reconstruction quality, and we report PSNR and SSIM gaps (the difference between our detected and peak numbers) as indicators of our detection performance. Common acronyms, pointers to external codes, detailed experiment settings, results on real-world denoising, inpainting, and super-resolution are in Appendices A.1, A.7.1, A.7.2, A.7.7, A.7.9 and A.7.10, respectively.

3.1. IMAGE DENOISING

Prior works dealing with DIP overfitting mostly focus on image denoising, but typically only evaluate their methods on one or two kinds of noise with low noise levels, e.g., low-level Gaussian noise. To stretch our evaluation, we consider 4 types of noise: Gaussian, shot, impulse, and speckle. We take the classical 9-image dataset (Dabov et al., 2008) , and for each noise type, generate two noise levels, low and high, i.e., level 2 and 4 of Hendrycks & Dietterich (2019) , respectively. See also the performance of our ES-WMV on real-world denoising in Tab. 1 and Appendix A.7.7.

Comparison with baseline ES methods

It is natural to expect that NR-IQMs, such as the classical BRISQUE (Mittal et al., 2012) , NIQE (Mittal et al., 2013) , and modern DNN-based NIMA (Esfandarani & Milanfar, 2018) can possibly make good ES criteria. We thus set up 3 baseline methods using BRISQUE, NIQE, and NIMA, respectively and seek the optimal x t by these metrics. Fig. 5 presents the comparison (in terms of PSNR gaps) of these 3 methods with our ES-WMV on denoising with low-level noise; results on high-level noise, and measured by SSIM are included in Appendix A.7.4. While our method enjoys favorable detection gaps (≤ 2) for most tested noise types/levels (except for Baboon, Kodak1, Kodak2 for certain noise types/levels; DIP itself is suboptimal in terms of denoising such images with substantial high-frequency components), detection gaps by the baseline methods can get huge (≥ 10). We report the results of SV-ES in Appendix A.7.5 since ES-WMV performs largely comparably to SV-ES. However, ES-WMV is much faster in wall-clock time, as reported in Tab. 2: for each epoch, the overhead of our ES-WMV is less than 3/4 of the DIP update itself, while SV-ES is around 25× of that. There is no surprise: while our method only needs to update the running variance of the {x t } t≥1 each time, SV-ES needs to train a coupled autoencoder which is extremely expensive. DOP is designed specifically just for impulse noise, so we compare ES-WMV with DOP on impulse noise (see Appendix A.7.5). The loss is changed to ℓ 1 to account for the sparse noise. In terms of the final PSNRs, DOP outperforms DIP with ES-WMV by a small gap, but even the peak PSNR of DIP with ℓ 1 lags behind DOP by about 2dB for high noise levels. The ES method in SB is acknowledged to fail for vanilla DIP. Moreover, their modified model still suffers from the overfitting issue beyond the very low noise levels, as shown in Fig. 20 . Their ES method fails to stop at appropriate places when the noise level is high. Hence, we test both ES-WMV and SB on their modified DIP model in (Shi et al., 2022) , based on two datasets they test: the classic 9-image dataset (Dabov et al., 2008) and CBSD68 dataset (Martin et al., 2001) . Qualitative results on the 9 images are shown in Appendix A.7.5; detected PSNR and stopping epochs on the CBSD68 dataset are reported in Tab. 3. For SB, the detection threshold parameter is fixed at 0.01. It is evident that both methods have similar detection performance for low noise levels but ES-WMV outperforms SB when the noise level is high. Also, ES-WMV tends to stop much earlier than SB, saving computational cost. We compare VAL with our ES-WMV on the 9-image dataset with low/high-level Gaussian and impulse noise. Since Ding et al. (2022) takes 90% pixels to train DIP and that usually decreases the peak performance, we report the final PSNRs detected by both methods (See Fig. 7 ). The two ES methods perform very comparably in image denoising, which is probably due to the mild violation of the iid assumption only, and also relatively low-degree information loss due to data splitting. The more complex nonlinear BID in Sec. 3.3 reveals their gap. ES-WMV as a helper for DIP variants DD, DIP+TV, GP-DIP represent different regularization strategies for controlling overfitting. A critical issue, however, is setting the right hyperparameters for them so that overfitting is removed while peaklevel performance is preserved. So practically, these methods are not free from overfitting, especially when the noise level is high. Thus, instead of treating them as competitors, we test if ES-WMV can reliably detect good ES points for them. We focus on Gaussian denoising, and report the results in Fig. 8 (a)-(c) and Appendix A.7.6. ES-WMV is able to attain ≤ 1 PNSR gap for most of the cases, with few outliers. These regularizations typically change the recovery trajectory. We suspect that finetuning of our method may improve on these corner cases. Here, we test SIREN, which is reviewed in Appendix A.5, as a replacement of DIP models for Gaussian denoising, and summarize the results in Fig. 8 and Fig. 21 . ES-WMV is again able to detect nearpeak performance for most images. 

3.3. BLIND IMAGE DEBLURRING (BID)

In BID, a blurry and noisy image is given, and the goal is to recover a sharp and clean image. The blur is mostly caused by motion and/or optical nonideality in the camera, and the forward process  G θ k (z k ) * G θx (z x )∥ 2 2 + λ∥∇G θx (z x )∥ 1 /∥∇G θx (z x )∥ 2 , where the regularizer is to promote sparsity in the gradient domain for reconstruction of x, as standard in BID. We follow Ren et al. ( 2020) and choose multi-layer perceptron (MLP) with softmax activation for G θ k , and the canonical DIP model (CNN-based encoderdecoder architecture) for G θx (z x ). We change their regularizer from the original ∥∇G θx (z x )∥ 1 to the current, as their original formulation is tested only on a very low noise level σ = 10 -5 and no overfitting is observed. We set to work with higher noise level σ = 10 -3 , and find that their original formulation does not work. The positive effect of the modified regularizer on BID is discussed in Krishnan et al. (2011) . . We systematically compare VAL and our ES-WMV on this difficult nonlinear IP, as we suspect that nonlinearity can break VAL down as discussed in Sec. 1, and subsampling the observation y for training-validation splitting may be unwise. Our results (Fig. 9 (bottom left/right)) confirm these predictions: the peak performance is much worse after 10% of elements in y are removed for valiation. In contrast, our ES-WMV returns quantitatively near-peak performance, far better than leaving the process to overfit. In Appendix A.7.12, we test both low-and high-level noise on the entire Levin dataset for completeness. The window size W (default: 100) and patience number P (default: 1000) are the only hyperparameters for ES-WMV. To study their impact on ES detection, we vary them across a range and check how the detection gap changes for Gaussian denoising on the classic 9-image dataset (Dabov et al., 2008) with medium-level noise, as shown in Fig. 10 for PSNR gaps and Fig. 26 for SSIM gaps. Our method is robust against these changes, and it seems larger W and P can bring in marginal improvement.

4. DISCUSSION

We have proposed a simple yet effective ES detection method (ES-WMV, and the ES-EMV variant) that works robustly across multiple visual IPs and DIP variants. In comparison, competing ES methods are noise-or DIP-model-specific, and only work for limited scenarios; Li et al. (2021) has comparable performance but it slows down the running speed too much; validation-based ES (Ding et al., 2022) works well for the simple denoising task while lags behind our ES method a lot in nonlinear IPs, e.g., BID. As for limitations, our theoretical justification is only partial, sharing the same difficulty of analyzing DNNs in general. Our ES method struggles with images with substantial high-frequency components; DIP needs to run numerous iterative steps for every instance, which is not ideal for time-constrained applications. Proof. To simplify the notation, we write y . = y -G θ 0 (z), J . = J G θ 0 , and c . = θθ 0 . So the least-squares objective in Eq. ( 4) is equivalent to

A APPENDIX

∥ y -J c∥ 2 2 (8) and the gradient update reads c t = c t-1 -ηJ ⊺ J c k-1 -y , where c 0 = 0 and x t = J c t + G θ 0 (z). The residual at time t can be computed as r t . = y -J c t (10) = y -J c t-1 -ηJ ⊺ J θ t-1 -y (11) = (I -ηJ J ⊺ ) y -J c t-1 (12) = (I -ηJ J ⊺ ) 2 y -J c t-2 = . . . (13) = (I -ηJ J ⊺ ) t y -J c 0 (using c 0 = 0) (14) = (I -ηJ J ⊺ ) t y. (15) Assume the SVD of J as J = W ΣV ⊺ . Then r t = I -ηW Σ 2 W ⊺ t y = i 1 -ησ 2 i t w ⊺ i yw i (16) and so J c t = y -r t = i 1 -1 -ησ 2 i t w ⊺ i yw i . ( ) Consider a set of W vectors V = {v 1 , . . . , v W }. We have that the empirical variance VAR(V) = 1 W W w=1 v w - 1 W W j=1 v j 2 2 = 1 W W w=1 ∥v w ∥ 2 2 - 1 W W w=1 v w 2 2 . ( ) So the variance of the set x t , x t+1 , . . . , x t+W -1 , same as the variance of the set J c t , J c t+1 , . . . , J c t+W -1 , can be calculated as 1 W W -1 w=0 i (w ⊺ i y) 2 1 -1 -ησ 2 i t+w 2 - 1 W 2 i (w ⊺ i y) 2 W -1 w=0 1 -1 -ησ 2 i t+w 2 (19) = 1 W 2 i (w ⊺ i y) 2   W W -1 w=0 1 -1 -ησ 2 i t+w 2 - W -1 w=0 1 -1 -ησ 2 i t+w 2   (20) = 1 W 2 i (w ⊺ i y) 2 W 2 + W (1 -ησ 2 i ) 2t (1 -(1 -ησ 2 i ) 2W ) 1 -(1 -ησ 2 i ) 2 -2W (1 -ησ 2 i ) t (1 -(1 -ησ 2 i ) W ) ησ 2 i -   W 2 -2W (1 -ησ 2 i ) t (1 -(1 -ησ 2 i ) W ) ησ 2 i + 1 -ησ 2 i 2t 1 -1 -ησ 2 i W 2 η 2 σ 4 i       (21) = 1 W 2 i ⟨w i , y⟩ 2 (1 -ησ 2 i ) 2t ησ 2 i W 1 -(1 -ησ 2 i ) 2W 2 -ησ 2 i - (1 -(1 -ησ 2 i ) W ) 2 ησ 2 i . ( ) So the constants C W,η,σi 's are defined as C W,η,σi . = 1 W 2 ησ 2 i W 1 -(1 -ησ 2 i ) 2W 2 -ησ 2 i - (1 -(1 -ησ 2 i ) W ) 2 ησ 2 i . ( ) To see they are nonnegative, it is sufficient to show that W 1 -(1 -ησ 2 i ) 2W 2 -ησ 2 i - (1 -(1 -ησ 2 i ) W ) 2 ησ 2 i ≥ 0 ⇐⇒ ησ 2 i W 1 -(1 -ησ 2 i ) 2W -2 -ησ 2 i (1 -(1 -ησ 2 i ) W ) 2 ≥ 0. (24) Now consider the function h(ξ, W ) = ξW 1 -(1 -ξ) 2W -(2 -ξ)(1 -(1 -ξ) W ) 2 ξ ∈ [0, 1], W ≥ 1. First, one can easily check that ∂ W h(ξ, W ) ≥ 0 for all W ≥ 1 and all ξ ∈ [0, 1], i.e., h(ξ, W ) is monotonically increasing with respect to W . Thus, in order to prove C W,η,σi ≥ 0, it suffices to show that h(ξ, 1) ≥ 0. Now h(ξ, 1) = ξ 1 -(1 -ξ) 2 -(2 -ξ)ξ 2 = 0, ) completing the proof.

A.3 PROOF OF 2.2

We first restate Theorem 2 in Heckel & Soltanolkotabi (2020b) . Theorem A.1 (Heckel & Soltanolkotabi (2020b) ). Let x ∈ R n be a signal in the span of the first p trigonometric basis functions, and consider a noisy observation y = x + n, where the noise n ∼ N 0, ξ 2 /n • I . To denoise this signal, we fit a two-layer generator network G C (B) = ReLU(U BC)v, where v = [1, . . . , 1, -1, . . . , -1]/ √ k, and B ∼ iid N (0, 1), and U is an upsampling operator that implements circular convolution with a given kernel u. Denote σ . = ∥u∥ 2 |F g(u ⊛ u/∥u∥ 2 2 )| 1/2 where g(t) = (1 -cos -1 (t)/π)t and ⊛ denotes the circular convolution. Fix any ε ∈ (0, σ p /σ 1 ], and suppose k ≥ C u n/ε 8 , where C u > 0 is a constant only depending on u. Consider gradient descent with step size η ≤ ∥F u∥ -2 ∞ (F u is the Fourier transform of u ) starting from C 0 ∼ iid N 0, ω 2 , entries, ω ∝ ∥y∥2 √ n . Then, for all iterates t obeying t ≤ 100 ησ 2 p , the reconstruction error obeys ∥G C t (B) -x∥ 2 ≤ 1 -ησ 2 p t ∥x∥ 2 + n i=1 ((1 -ησ 2 i ) t -1) 2 (w ⊺ i n) 2 + ε∥y∥ 2 with probability at least 1 -exp -k 2 -n -2 . Note that since B ∼ iid N (0, 1) and hence is full-rank with probability one, the original Theorem  η ≤ ∥F u∥ -2 ∞ (F u is the Fourier transform of u ) starting from C 0 ∼ iid N 0, ω 2 , entries, ω ∝ ∥y∥2 √ n . Then, for all iterates t obeying t ≤ 100 ησ 2 p , our WMV obeys WMV ≤ 12 W ∥x∥ 2 2 1 -ησ 2 p 2t 1 -(1 -ησ 2 p ) 2 + 12 n i=1 1 -ησ 2 i t+W -1 -1 2 (w ⊺ i n) 2 + 12ε 2 ∥y∥ 2 2 (27) with probability at least 1 -exp -k 2 -n -2 . Proof. We make use of the basic inequality: ∥a -b∥ 2 2 ≤ 2∥a∥ 2 2 + 2∥b∥ 2 2 for any two vectors a, b of compatible dimension. We have 1 W W -1 w=0 ∥G C t+w (B) - 1 W W -1 j=0 G C t+j (B)∥ 2 2 (28) = 1 W W -1 w=0 ∥G C t+w (B) -x + x - 1 W W -1 j=0 G C t+j (B)∥ 2 2 (29) ≤ 2 W W -1 w=0 ∥G C t+w (B) -x∥ 2 2 + 2∥x - 1 W W -1 j=0 G C t+j (B)∥ 2 2 (30) ≤ 2 W W -1 w=0 ∥G C t+w (B) -x∥ 2 2 + 2 W W -1 j=0 ∥G C t+j (B) -x∥ 2 2 (31) (z → ∥z -x∥ 2 2 convex and Jensen's inequality) = 4 W W -1 w=0 ∥G C t+w (B) -x∥ 2 2 . ( ) In view of Theorem A.1, ∥G C t+w (B) -x∥ 2 2 ≤ 3 1 -ησ 2 p 2t+2w ∥x∥ 2 2 + 3 n i=1 1 -ησ 2 j t+w -1 2 (w ⊺ i n) 2 + 3ε 2 ∥y∥ 2 2 . (33) Thus, W -1 w=0 ∥G C t+w (B) -x∥ 2 2 ≤ 3∥x∥ 2 2 W -1 w=0 1 -ησ 2 p 2t+2w + 3 W -1 w=0 n i=1 1 -ησ 2 i t+w -1 2 (w ⊺ i n) 2 + 3W ε 2 ∥y∥ 2 2 (34) ≤ 3∥x∥ 2 2 1 -ησ 2 p 2t (1 -(1 -ησ 2 p ) 2W ) 1 -(1 -ησ 2 p ) 2 + 3W n i=1 1 -ησ 2 i t+W -1 -1 2 (w ⊺ i n) 2 + 3W ε 2 ∥y∥ 2 2 (35) ≤ 3∥x∥ 2 2 1 -ησ 2 p 2t 1 -(1 -ησ 2 p ) 2 + 3W n i=1 1 -ησ 2 i t+W -1 -1 2 (w ⊺ i n) 2 + 3W ε 2 ∥y∥ 2 2 , completing the proof.

A.4 ES-EMV ALGORITHM

The exponential moving variance version of our method is summarized in Algorithm 2. Algorithm 2 DIP with ES-EMV Input: random seed z, randomly-initialized G θ , forgetting factor α ∈ (0, 1), patience number P , iteration counter k = 0, EMA 0 = 0, EMV 0 = 0, EMV min = ∞ Output: reconstruction x * 1: while not stopped do 2: update θ via Eq. ( 2) to obtain θ k+1 and x k+1 3: EMA k+1 = (1 -α)EMA k + αx k+1 4: EMV k+1 = (1 -α)EMV k + α(1 -α)∥x k+1 -EMA k ∥ 2 2 5: if EMV k+1 < EMV min then 6: EMV min ← EMV k+1 , x * ← x k+1 7: end if 8: if EMV min stagnates for P iterations then (Heckel & Hand, 2019) differs from DIP mainly in terms of the network architecture: it is typically an under-parameterized network consisting of mainly 1×1 convolutions, upsampling, ReLU and channel-wise normalization layers, while DIP uses an over-parameterized, U-net like convolutional network. (Cheng et al., 2019) uses the original DIP (Ulyanov et al., 2018) network and formulation, but replaces the stochastic gradient descent (SGD) by stochastic gradient Langevin dynamics (SGLD) in the gradient update step. i.e., for generic gradient step for optimizing Eq. ( 2) reads:

GP-DIP

θ + = θ -t∇ θ [ℓ(y, f (G θ (z))) + λR(G θ (z))] + η (37) where η is zero-mean Gaussian with an isotropic variance level t. DIP-TV (Cascarano et al., 2021) uses the original DIP (Ulyanov et al., 2018) network, with a Total Variation (TV) regularizer added. Then, the proposed objective is solved with Alternating Direction Method of Multipliers (ADMM) framework. SIREN (Sitzmann et al., 2020) treats the object directly as a continuous function on R 2 or R 3 (or higher-dimensional spaces depending on the application) and hence parameterizes it as a multi-layer perceptron (MLP): 1) the input to SIREN is the 2D/3D coordinate of each pixel instead of random values, and 2) the network uses a sinusoidal activation function instead of the commonly used ReLU. When substituting the DIP network with SIREN and solve Eq. ( 2) problems, similar overfitting issue is still observed.

A.6 MORE DETAILS ON MAJOR ES METHODS

Here, we provide more details on major competing methods, all of them ES-based except for You et al. (2020) . Spectral Bias (SB) Shi et al. (2022) operates on DD models, and proposes two modifications to change the spectral bias: (1) controlling the operator norm of the weight w for each convolutional layer by the normalization w ′ = w max 1, ∥w∥ op /λ , ensuring that ∥w ′ ∥ op ≤ λ, which in turn controls the Fourier spectrum of the underlying function represented by the layer; (2) performing Gaussian upsampling instead of the typical bilinear upsampling to suppress the smoothness effect of the latter. These two modifications with appropriate parameter setting (λ, and σ in Gaussian filtering) can improve the learning of the high-frequency components by DD, and allow the blurriness-over-sharpness stopping criterion ∆r x t = 1 W W w=1 r x t-w - W w=1 r x t-W -w , where r(x ′ ) = B(x ′ )/S(x ′ ), and B(•) and S(•) are the blurriness and sharpness metrics in Crete et al. (2007) and Bahrami & Kot (2014) , respectively. In other words, the criterion in Eq. ( 39) measures the change of average blurriness-over-sharpness ratios over consecutive windows of size W , and small changes indicate good ES points. But, as said, this criterion only works for the modified DD models and not other DIP variants, as acknowledged by the authors in Shi et al. (2022) and confirmed in our experiment (see Sec. 3.1).

DF-STE Jo et al. (2021) targets

Gaussian denoising with known noise levels (i.e., y = x + n, where n is iid Gaussian noise), and considers the objective min θ 1 n 2 ∥y -G θ (y)∥ 2 F + σ 2 n 2 tr J G θ (y), where tr J G θ (y) is the trace of the network Jacobian with respect to the input, i.e., the divergence term in Jo et al. (2021) . The divergence term is a proxy for controlling the capacity of the network. The paper then proposes a heuristic zero-crossing stopping criterion that stops the iteration when the loss starts to cross zero into negative values. Although the idea works reasonably well on Gaussian denoising with low and known noise level (the variance level σ 2 is explicitly needed in the regularization parameter ahead of the divergence term), it starts to break down when the noise level increases even if the right noise level is provided; see Sec. 3.1. Also, although the paper has extended the formulation to handle Poisson noise, it is unclear how to generalize the idea for handling other types of noise, as well as how to move beyond simple additive denoising problems. Li et al. (2021) proposes training an autoencoder online using the reconstruction sequence {x t } t≥1 :

SV-ES

min w,v t≥1 ℓ AE x t , D w • E v x t . Any new x t is passed through the current autoencoder, and the reconstruction error ℓ AE is recorded. We observe that the error curve typically follows a U-shape, and the valley of the curve is approximately aligned with the peak of the PNSR curve. We hence design an ES method by detecting the valley of the error curve. This method works reasonably well across different IPs and different DIP variants. A major drawback is the efficiency: the overhead caused by online training of the autoencoder is order-of-magnitude larger than the cost of DIP update itself, as shown in Tab. 2. DOP You et al. ( 2020) considers additive sparse (e.g., salt-and-pepper noise) noise only and proposes modeling the clean image and noise explicitly in the objective: min θ,g,h ∥y -G θ (z) -(g • g -h • h)∥ 2 F , where the overparametrized term g • gh • h (• denotes the Hadamard product) is meant to capture the sparse noise, where a similar idea has proved effective for sparse recovery in Vaskevicius et al. (2019) . Different properly-tuned learning rates for the clean image and sparse noise terms are necessary for success. The downside includes the prolonged running time as it pushes the peak reconstruction to the very last iteration, and the difficulty to extend the idea to other types of noise.

A.7 ADDITIONAL EXPERIMENTAL DETAILS & RESULTS

A.7. et al. (2018) ; the optimizer is ADAM with a learning rate 0.01. For all other models, we use their default architectures, optimizers, and hyperparameters. For ES-WMV, the default window size W = 100, and patience number P = 1000. We use both PSNR and SSIM to access the reconstruction quality, and we report PSNR and SSIM gaps (the difference between our detected and peak numbers) as an indicator of our detection performance. For most experiments, we repeat the experiments 3 times to report the mean and standard deviation; when not, we explain why. Noise generation Following the noise generation rules of Hendrycks & Dietterich (2019) 1 , we simulate four kinds of noise and three intensity levels for each noise type. The detailed information is as follows. • Gaussian noise: 0 mean additive Gaussian noise with variance 0.12, 0.18, and 0.26 for low, medium, and high noise levels, respectively; • Impulse noise: also known as salt-and-pepper noise, replacing each pixel with probability p ∈ [0, 1] into white or black pixel with half chance each. Low, medium, and high noise levels correspond to p = 0.3, 0.5, 0.7, respectively; • Speckle noise: for each pixel x ∈ [0, 1], the noisy pixel is x(1 + ε), where ε is 0-mean Gaussian with a variance level 0.20, 0.35, 0.45 for low, medium, and high noise levels, respectively; • Shot noise: also known as Poisson noise. For each pixel, x ∈ [0, 1], the noisy pixel is Poisson distributed with rate λx, where λ is 25, 12, 5 for low, medium, and high noise levels, respectively.

A.7.3 DENOISING EXAMPLES

On image denoising with different types and levels of noise, our ES method can help DIP to detect near-peak ES points, as shown in Fig. 11 . We also explore the possibility of using the loss for ES here, but we fail to find correlations between the trend of the loss and that of the PSNR curve. Comparison between ES-WMV with DF-STE for Gaussian and shot noise on the 9-image dataset in terms of SSIM is reported in Fig. 18 . Furthermore, we also test our ES-WMV and DF-STE on CBSD68 in Tab. 5. Our ES-WMV wins in high-level noise cases, but lags behind DF-STE in the low-level cases. The gaps between our ES-WMV and DF-STE for all noise levels mostly come from the peak-performance between the original DIP and DF-STE-modifications in DF-STE have affected the peak performance, positively for low-level cases and negatively for high-level cases, not  ℓ(θ) = ∥(G θ (z) -y) ⊙ m∥ 2 F . ( ) The mask m is generated according to an iid Bernoulli model with a rate of 50%, i.e., half of pixels not observed in expectation. The noise ε is set to the medium level, i.e., additive Gaussian with 0 mean and 0.18 variance. We test our ES-WMV for DIP on the inpainting dataset used in the original DIP paper Ulyanov et al. (2018) . The PSNR gaps are ≤ 1.00 and the SSIM gaps are ≤ 0.05 for most cases (see Tab. 8). We also visualize two examples in Fig. 24 . 3×H×W is a downsampling operator that resizes an image by the factor t. Then given y and t, the goal is to reconstruct x 0 . We consider the formulation reparametrized by DIP, where G θ is a trainable DNN parametrized by θ and z is a frozen random seed: ℓ(θ) = ∥D t (G θ (z)) -y∥ 2 F . The noise ε is again set to the medium level, i.e., additive Gaussian with 0 mean and 0.18 variance. We test our ES-WMV for DIP on the super-resolution dataset used in the original DIP paper Ulyanov et al. (2018) . The PSNR gaps are ≤ 1.00 and the SSIM gaps are ≤ 0.05 for most cases (see Tab. 9). Our ES-WMV is again able to detect near-peak performance for most images. 2018) and the classic 9-image dataset (Dabov et al., 2008) (see Tabs. 10 and 11 and Fig. 25 ), due to the strong smoothing effect (we set α = 0.1). In this paper, we prefer to stay simple and leave systematic evaluations of these variants for more tasks as future work. A.7.12 BLIND IMAGE DEBLURRING (BID) In this section, we systematically test our ES-WMV and VAL on the entire standard Levin dataset for both low-level and high-level cases. We set the maximum number of iterations as 10, 000 to ensure that we perform sufficient optimization. The detected images of our ES-WMV are substantially better than those of VAL, as shown in Tab. 12. We vary the window size W (default 100) and patience number P (default: 1000) across a range and check how the detection gap changes for Gaussian denoising with medium-level noise on the classic 9-image dataset (see:Fig. 26 ). We note that there are some occasional failures cases when applying our ES on some DIP variants in Fig. 8 . In this section, we provide VAR curves of these cases. For the failure of GP-DIP on the "House (L)" image in Fig. 8 , GP-DIP has a weird multi-valley, gradual descending pattern in the VAR curve, corresponding to a multi-peak, gradual ascending pattern in the PSNR curve. The first major valley in the VAR curve is roughly aligned with the first major peak, not the final best peak, in the PSNR curve. So although our valley-detection method successfully detects the first major valley, the PSNR gap is relatively large. Overall, although our ES method works well with GP-DIP for most of the test cases, we would not recommend GP-DIP for practical use. The concern is the speed: as a method trying to mitigate the overfitting, the best reconstruction of GP-DIP tends to be around the very last iterates. The failure on the "Lena(L)" image is due to a similar multivalley pattern in the VAR curve. For both cases, we observe that using smaller learning rates for GP-DIP and DD helps to smooth out their curves and mitigate the multi-valley phenomenon which likely will lead to much smaller detection gaps. We hesitate to refine in this direction, as our focus of this paper is on the ES method itself.



https://github.com/hendrycks/robustness http://www.cs.tut.fi/ ˜foi/GCF-BM3D/index.html#ref_results



Figure 3: ES-WMV on DD, GP-DIP, DIP-TV, and SIREN for denoising "F16" with different levels of Gaussian noise (top: low-level noise; bottom: high-level noise). Red curves are PSNR curves, and blue curves are VAR curves. The green bars indicate the detected ES points. (We sketch the details of the DIP variants above in Appendix A.5) Partial theoretical justification We can make our heuristic argument in Sec. 2 more rigorous by restricting ourselves to additive denoising, i.e., y = x + n, and appealing to the popular linearization strategy (i.e., neural tangent kernel Jacot et al. (2018); Heckel & Soltanolkotabi (2020b)) in understanding DNNs. The idea is based on the assumption that during DNN training θ does not move much away from initialization θ 0 , so that the learning dynamic can be approximated by that of a linearized model, i.e., suppose that we take the MSE loss

Figure 6: Comparison of DF-STE and ES-WMV for Gaussian and shot noise in terms of PSNR.Competing methods DF-STE(Jo et al., 2021) is specific for Gaussian and Poisson denoising, and the noise variance is needed for their tuning parameters. Fig.6presents the comparison with DF-STE in terms of PSNR. SSIM results are in Appendix A.7.5. Here, we directly report the final PSNRs obtained by both methods. For low-level noise, there is no clear winner. For high-level noise, ES-WMV outperforms DF-STE by considerable margins. Although the right variance level is provided to DF-STE in order to tune their regularization parameters, DF-STE stops after only very few epochs leading to very low performance and almost zero standard deviations-they return almost the noisy input. However, we do not perform any parameter tuning for ES-WMV. We further compare the two methods on CBSD68 in Appendix A.7.5.

Figure 7: Comparison of VAL and ES-WMV for Gaussian and impulse noise in terms of PSNR.

Figure 8: Performance of ES-WMV on DD, GP-DIP, DIP-TV, and SIREN for Gaussian denoising in terms of PSNR gaps. L: low noise level; H: high noise level. ES-WMV as a helper for implicit neural representations (INRs) INRs, such as Tancik et al. (2020) and Sitzmann et al. (2020), use multilayer perceptrons to represent high-frequency functions in lowdimensional problem domains and have achieved superior results on complex 3D visual tasks. We further extend our ES-WMV to help the INR family and take SIREN (Sitzmann et al., 2020) as an example. SIREN parameterizes x as the discretization of a continuous function: this function takes into spatial coordinates and returns the corresponding function values.Here, we test SIREN, which is reviewed in Appendix A.5, as a replacement of DIP models for Gaussian denoising, and summarize the results in Fig.8and Fig.21. ES-WMV is again able to detect nearpeak performance for most images.

ES-WMV on MRI reconstruction, a classical linear IP with a nontrivial forward mapping: y ≈ F(x), where F is the subsampled Fourier operator, and we use ≈ to indicate that the noise encountered in practical MRI imaging may be hybrid (e.g., additive, shot) and uncertain. Here, we take 8-fold undersampling and parametrize x using "Conv-Decoder" (Darestani & Heckel, 2021), a variant of DD. Due to the heavy overparameterization, overfitting occurs, and ES is needed. Darestani & Heckel (2021) directly sets the stopping point at the 2500-th epoch, and we run our ES-WMV. We visualize the performance on two random cases (C1: 1001339 and C2: 1000190 sampled from Darestani & Heckel (2021), part of the fastMRI datatset (Zbontar et al., 2018)) in Fig. 23 (quality measured in SSIM, consistent with Darestani & Heckel (2021)). It is clear that ES-WMV detects near-peak performance for both cases, and it is adaptive enough to yield comparable or better ES points than heuristically fixed ES points. We further test our ES-WMV on ConvDecoder for 30 cases from the fastMRI dataset (see Tab. 4), which shows the precise and stable detection of ES-WMV.

is often modeled as y = k * x + n, where k is the blur kernel, n models additive sensory noise, and * is linear convolution to model the spatial uniformity of the blur effect(Szeliski, 2022). BID is a very challenging visual IP due to the bilinearity: (k, x) → k * x. Recently, Ren et al. (2020); Wang et al. (2019); Asim et al. (2020); Tran et al. (2021) have tried to use DIP models to solve BID by modeling k and x as two separate DNNs, i.e., min θ k ,θx ∥y -

Figure 9: Top left: ES-WMV on BID; Top right: visual results of ES-WMV; Bottom: quantitative results of ES-WMV and VAL, respectivelyFirst, we take 4 images and 3 kernels from the standard Levin dataset(Levin et al., 2011), resulting in 12 image-kernel combinations. The high noise level leads to substantial overfitting, as shown in Fig.9(top left). Nonetheless, ES-WMV can reliably detect good ES points and lead to impressive visual reconstructions (see Fig.9(top right)). We systematically compare VAL and our ES-WMV on this difficult nonlinear IP, as we suspect that nonlinearity can break VAL down as discussed in Sec. 1, and subsampling the observation y for training-validation splitting may be unwise. Our results (Fig.9(bottom left/right)) confirm these predictions: the peak performance is much worse after 10% of elements in y are removed for valiation. In contrast, our ES-WMV returns quantitatively near-peak performance, far better than leaving the process to overfit. In Appendix A.7.12, we test both low-and high-level noise on the entire Levin dataset for completeness.

Figure 10: Effect of W and P

be a signal in the span of the first p trigonometric basis functions, and consider a noisy observation y = x + n, where the noise n ∼ N 0, ξ 2 /n • I . To denoise this signal, we fit a two-layer generator network G C (B) = ReLU(U BC)v, where v = [1, . . . , 1, -1, . . . , -1]/ √ k, and B ∼ iid N (0, 1), and U is an upsampling operator that implements circular convolution with a given kernel u. Denote σ . = ∥u∥ 2 |F g(u⊛u/∥u∥ 2 2 )| 1/2 where g(t) = (1 -cos -1 (t)/π)t and ⊛ denotes the circular convolution. Fix any ε ∈ (0, σ p /σ 1 ], and suppose k ≥ C u n/ε 8 , where C u > 0 is a constant only depending on u. Consider gradient descent with step size

while A.5 MORE DETAILS ON MAJOR DIP VARIANTS Deep Decoder (DD)

Figure 11: Our ES-WMV method on DIP for denoising "F16" with different noise types and levels (top: low-level noise; bottom: high-level noise). Red curves are PSNR curves, and blue curves are VAR curves. The green bars indicate the detected ES point.

Figure 12: Our ES-WMV method on DIP for denoising "F16" with different noise types and levels (top: low-level noise; bottom: high-level noise). Red curves are PSNR curves, and brown curves are loss curves.

Figure 13: Visual comparisons of NR-IQMs and ES-WMV. From top to bottom: Gaussian noise (low), Gaussian noise (high), impulse noise (low), impulse noise (high).

Figure 14: Visual comparisons of NR-IQMs and ES-WMV. From top to bottom: shot noise (low), shot noise (high), speckle noise (low), speckle noise (high).

Figure15: High-level noise detection performance in terms of PSNR gaps. For NIMA, we report both technical quality assessment (NIMA-q) and aesthetic assessment (NIMA-a). Smaller PSNR gaps are better.

Figure16: Low-level noise detection performance in terms of SSIM gaps. For NIMA, we report both technical quality assessment (NIMA-q) and aesthetic assessment (NIMA-a). Smaller SSIM gaps are better.

Figure17: High-level noise detection performance in terms of SSIM gaps. For NIMA, we report both technical quality assessment (NIMA-q) and aesthetic assessment (NIMA-a). Smaller SSIM gaps are better.

Figure 18: Comparison of DF-STE and ES-WMV for Gaussian and shot noise in terms of SSIM.

Figure 19: Low-and high-level noise detection performance of SV-ES and ours in terms of PSNR gaps.

Figure 20: Comparison between ES-WMV and SB for image denoising (top: σ = 15; middle: σ = 25; bottom: σ = 50). The red and blue curves are the PNSR and the ratio metric curves. The orange and green bars indicate the ES points detected by our ES-WMV and SB, respectively.

Figure 21: Performance of ES-WMV on DD, GP-DIP, DIP-TV, and SIREN for Gaussian denoising in terms of SSIM gaps. L: low noise level; H: high noise level

Figure 23: Detection on MRI reconstruction

Figure 24: Visual detection performance of ES-WMV on image inpainting.

Figure 25: Detected PSNR comparison between DIP with ES-WMV and DIP with ES-EMV on the classic 9-image dataset(Dabov et al., 2008).

Figure 26: Effect of patience number and window size on detection in terms of SSIM gaps

Ding et al., 2022) that hold part of the observation y out as a validation set to emulate validation-based ES in supervised learning, but they quickly become problematic for nonlinear IPs due to the significant violation of the underlying iid assumption; seeSec. 3.3.   You et al. (2020)  models sparse additive noise as an explicit term in their optimization objective.Jo et al. (2021) designs regularizers and ES criteria specific to Gaussian and shot noise.Ding et al. (2021) explores subgradient methods with diminishing step-size schedules for impulse noise with the ℓ 1 loss, with preliminary success. These methods do not work beyond the noise types and levels they target, whereas our knowledge about the noise in a given visual IP is typically limited. (3) Early stopping (ES):Shi et al. (2022) tracks the progress based on a ratio of no-reference blurriness and sharpness, but the criterion only works for their modified DIP models, as acknowledged by the authors.Jo et al. (2021) provides noise-specific regularizer and ES criterion, but it is unclear how to extend the methods to unknown noise types and levels.Li et al. (2021) proposes monitoring the DIP reconstruction by training a coupled autoencoder. Although its performance is similar to ours, the extra autoencoder training slows down the whole process dramatically; see Sec. 3. Yaman et al. (2021); Ding et al. (2022) emulate validation-based ES in supervised learning by splitting elements of y into training and validation sets so that validation-based ES can be performed. But in IPs, especially nonlinear ones (e.g., in blind image deblurring-BID, y ≈ k * x where * is linear convolution), elements of y can be far from being iid and so validation may not work well. Moreover, holding-out part of the observation in y can substantially reduce the peak performance; see Sec. 3.3.





Comparison between ES-WMV and SB for image denoising on the CBSD68 dataset with varying noise level σ. Higher detected PSNR and earlier detection are better, which are in red: mean and (std).



Comparison between ES-WMV and DF-STE for image denoising on the CBSD68 dataset with varying noise level σ: mean and (std). PSNR gaps below 1.0 are colored as red.



DIP with ES-WMV on real image denoising on the PolyU Dataset: mean and (std).As stated from the beginning, ES-WMV is designed with real-world IPs, targeting unknown noise types and levels. Given the encouraging performance above, we test it on a common real-world denoising dataset-PolyU DatasetXu et al. (2018), which contains 100 cropped regions of 512×512 from 40 scenes. The results are reported in Tab. 7. We do not repeat the experiments here; the means and standard deviations are obtained over the 100 images of the PolyU dataset. On average, our detection gaps are ≤ 1.64 in PSNR and ≤ 0.01 in SSIM for this dataset across various losses. The absolute PNSR and SSIM detected are surprisingly high.

Detection performance of DIP with ES-WMV for image inpainting: mean and (std). PSNR gaps below 1.00 are colored as red; SSIM gaps below 0.05 are colored as blue. (D: Detected)

Detection performance of DIP with ES-WMV for 4× image super-resolution: mean and (std). PSNR gaps below 1.00 are colored as red; SSIM gaps below 0.05 are colored as blue. (D: .7.11 ES-WMV VS. ES-EMV We now consider our memory-efficient version (ES-EMV) as described in Algorithm 2, and compare it with ES-WMV, as shown in Fig. 25. Besides the memory benefit, ES-EMV runs around 100 times faster than ES-WMV, as reported in Tab. 2 and does seem to provide a consistent improvement on the detected PSNRs for image denoising tasks on NTIRE 2020 Real Image Denoising Challenge (Abdelhamed et al., 2020), PolyU dataset Xu et al. (

Detection performance comparison between DIP with ES-WMV and DIP with ES-EMV for real image denoising on 1024 images from the RGB track of NTIRE 2020 Real Image Denoising Challenge (Abdelhamed et al., 2020): mean and (std). Higher PSNR and SSIM are in red. (D:



BID detection comparison between ES-WMV and VAL on the Levin dataset for both lowlevel and high-level noise: mean and (std).Higher PSNR is in red and higher SSIM is in blue. (D: Detected)

