EARLY STOPPING FOR DEEP IMAGE PRIOR

Abstract

Deep image prior (DIP) and its variants have shown remarkable potential for solving inverse problems in computational imaging (CI), needing no separate training data. Practical DIP models are often substantially overparameterized. During the learning process, these models first learn the desired visual content and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP hinges on early stopping (ES) that can capture the transition period. In this regard, most previous DIP works for CI tasks only demonstrate the potential of the models-reporting the peak performance against the groundtruth but providing no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy that consistently detects near-peak performance across several CI tasks and DIP variants. Simply based on the running variance of DIP intermediate reconstructions, our ES method not only outpaces the existing ones-which only work in very narrow regimes, but also remains effective when combined with methods that try to mitigate overfitting.

1. INTRODUCTION

Inverse problems (IPs) are prevalent in computational imaging (CI), ranging from basic image denoising, super-resolution, and deblurring, to advanced 3D reconstruction and major tasks in scientific and medical imaging (Szeliski, 2022) . Despite the disparate settings, all these problems take the form of recovering a visual object x from y = f (x), where f models the forward process to obtain the observation y. Typically, these visual IPs are underdetermined: x cannot be uniquely determined from y. This is exacerbated by potential modeling (e.g., linear f to approximate a nonlinear process) and observational (e.g., Gaussian or shot) noise, i.e., y ≈ f (x). To overcome the nonuniqueness and improve noise stability, people often encode a variety of problem-specific priors on x when formulating IPs. Traditionally, IPs are phrased as regularized data-fitting problems: min x ℓ(y, f (x)) + λR(x) ℓ(y, f (x)) : data-fitting loss, R(x) : regularizer (1) where λ is the regularization parameter. Here, the loss ℓ is often chosen according to the noise model, and the regularizer R encodes priors on x.  min θ ℓ(y, f • G θ (z)) + λR • G θ (z). ( ) G θ is often "overparameterized"-containing substantially more parameters than the size of x, and "structured"-e.g., consisting of convolution networks to encode structural priors in natural visual objects. The resulting optimization problem is solved via standard first-order methods for modern DL (e.g., (adaptive) gradient descent). When x has multiple components with different physical Overfitting issue in DIP A critical detail that we have glossed over is overfitting. Since G θ is substantially overparameterized, G θ (z) can represent arbitrary elements in the x domain. Global optimization of (2) would normally lead to y = f (G θ (z)), but G θ (z) may not reproduce x, e.g., when f is non-injective, or y ≈ f (x) so that G θ (z) also accounts for the modeling and observational noise. Fortunately, DIP models and first-order optimization methods together offer a blessing: in practice, G θ (z) has a bias toward the desired visual content and learns it much faster than learning noise. So the reconstruction quality climbs to a peak before potential degradation due to noise; see Fig. 1 . This "early-learning-then-overfitting" (ELTO) phenomenon has been repeatedly reported in prior works and is also backed by theories on simple G θ and linear f (Heckel & Soltanolkotabi, 2020b;a). The successes of DIP models claimed above are mostly conditioned on that appropriate early stopping (ES) around the performance peaks can be made. Is ES for DIP trivial? Natural ideas trying to perform good ES can fail quickly. (1) Visual inspection: This subjective approach is fine for small-scale tasks involving few problem instances, but quickly becomes infeasible for many scenarios, such as (a) large-scale batch processing, (b) recovery of visual contents tricky to be visualized and/or examined by eyes (e.g., 3D or 4D visual objects), and (c) scientific imaging of unfamiliar objects (e.g., MRI imaging of rare tumors, and microscopic imaging of new virus species); (2) Tracking full-reference/no-reference image quality metrics (FR/NR-IQMs): Without the groundtruth x, computing any FR-IQM and hence tracking their trajectories (e.g., the PNSR curve in Fig. 1 ) is out of the question. We consider tracking NR-IQMs as a family of baseline methods in Sec. , such as the total-variation norm or trained denoisers. However, in general, it is difficult to choose the right regularization-level to preserve the peak performance while avoiding overfitting, and the optimal λ likely depends on the noise type and level, as shown in Sec. 3.1-the default λ's for selected methods in this category



meanings, one can naturally parametrize x using multiple DNNs. This simple idea has led to surprisingly competitive results on numerous visual IPs, from low-level image denoising, super-resolution, inpainting(Ulyanov et al., 2018; Heckel & Hand, 2019; Liu et al., 2019)  and blind deconvolution(Ren et al., 2020; Wang et al., 2019; Asim et al., 2020; Tran et al., 2021; Zhuang et al., 2022a), to mid-level image decomposition and fusion(Gandelsman et al., 2019; Ma et al., 2021), and to advanced CI problems(Darestani & Heckel, 2021; Hand et al., 2018; Williams et al., 2019; Yoo  et al., 2021; Baguer et al., 2020; Cascarano et al., 2021; Hashimoto & Ote, 2021; Gong et al., 2022;  Veen et al., 2018; Tayal et al., 2021; Zhuang et al., 2022b); see the surveyQayyum et al. (2021).

Figure 1: The "early-learning-then-overfitting" (ELTO) phenomenon in DIP for image denoising. The quality of the estimated image climbs to a peak first and then plunges once the noise is picked up by the model G θ (z) also.

There are three main approaches to countering overfitting in working with DIP models. (1) Regularization: Heckel & Hand (2019) mitigates overfitting by restricting the size of G θ to the underparameterized regime. Metzler et al. (2018); Shi et al. (2022); Jo et al. (2021); Cheng et al. (2019) control the network capacity by regularizing the norms of layerwise weights or the network Jacobian. Liu et al. (2019); Mataev et al. (2019); Sun (2020); Cascarano et al. (2021) use additional regularizer(s) R(G θ (z))

Ulyanov et al. (2018)  proposes parameterizing x as x = G

3.1; the performance is much worse than ours; (3) Tuning the iteration number: This ad-hoc solution is taken by most previous works. But since the peak iterations of DIP vary considerably across images and tasks (see, e.g.,, this might entail numerous trial-and-error steps and lead to suboptimal stopping points; (4) Validation-based ES: ES easily reminds us of validation-based ES in supervised learning. The DIP approach to IPs as summarized in Eq. (2) is not supervised learning, as it only deals with a single instance y, without separate (x, y) pairs as training data. There are recent ideas(Yaman et al., 2021; Ding et al., 2022)  that hold part of the observation y out as a validation set to emulate validation-based ES in supervised learning, but they quickly become problematic for nonlinear IPs due to the significant violation of the underlying iid assumption; seeSec. 3.3.

