LEARNING SPARSE AND LOW-RANK PRIORS FOR IMAGE RECOVERY VIA ITERATIVE REWEIGHTED LEAST SQUARES MINIMIZATION

Abstract

We introduce a novel optimization algorithm for image recovery under learned sparse and low-rank constraints, which we parameterize as weighted extensions of the ℓ p p -vector and S p p Schatten-matrix quasi-norms for 0 < p ≤ 1, respectively. Our proposed algorithm generalizes the Iteratively Reweighted Least Squares (IRLS) method, used for signal recovery under ℓ 1 and nuclear-norm constrained minimization. Further, we interpret our overall minimization approach as a recurrent network that we then employ to deal with inverse low-level computer vision problems. Thanks to the convergence guarantees that our IRLS strategy offers, we are able to train the derived reconstruction networks using a memory-efficient implicit back-propagation scheme, which does not pose any restrictions on their effective depth. To assess our networks' performance, we compare them against other existing reconstruction methods on several inverse problems, namely image deblurring, super-resolution, demosaicking and sparse recovery. Our reconstruction results are shown to be very competitive and in many cases outperform those of existing unrolled networks, whose number of parameters is orders of magnitude higher than that of our learned models. The code is available at this link.

1. INTRODUCTION

With the advent of modern imaging techniques, we are witnessing a significant rise of interest in inverse problems, which appear increasingly in a host of applications ranging from microscopy and medical imaging to digital photography, 2D&3D computer vision, and astronomy (Bertero & Boccacci, 1998 ). An inverse imaging problem amounts to estimating a latent image from a set of possibly incomplete and distorted indirect measurements. In practice, such problems are typical illposed (Tikhonov, 1963; Vogel, 2002) , which implies that the equations relating the underlying image with the measurements (image formation model) are not adequate by themselves to uniquely characterize the solution. Therefore, in order to recover approximate solutions, which are meaningful in a statistical or physical sense, from the set of solutions that are consistent with the image formation model, it is imperative to exploit prior knowledge about certain properties of the underlying image. Among the key approaches for solving ill-posed inverse problems are variational methods (Benning & Burger, 2018) , which entail the minimization of an objective function. A crucial part of such an objective function is the regularization term, whose role is to promote those solutions that fit best our prior knowledge about the latent image. Variational methods have also direct links to Bayesian methods and can be interpreted as seeking the penalized maximum likelihood or the maximum a posteriori (MAP) estimator (Figueiredo et al., 2007) , with the regularizer matching the negative log-prior. Due to the great impact of the regularizer in the reconsturction quality, significant research effort has been put in the design of suitable priors. Among the overwhelming number of existing priors in the literature, sparsity and low-rank (spectral-domain sparsity) promoting priors have received considerable attention. This is mainly due to their solid mathematical foundation and the competitive results they achieve (Bruckstein et al., 2009; Mairal et al., 2014) . Nowdays, thanks to the advancements of deep learning there is a plethora of networks dedicated to image reconstruction problems, which significantly outperform conventional approaches. Nevertheless, they are mostly specialized and applicable to a single task. Further, they are difficult to analyze and interpret since they do not explicitly model any of the well-studied image properties, successfully utilized in the past (Monga et al., 2021) . In this work, we aim to harness the power of supervised learning but at the same time rely on the rich body of modeling and algorithmic ideas that have been developed in the past for dealing with inverse problems. To this end our contributions are: (1) We introduce a generalization of the Iterative Reweighted Least Squares (IRLS) method based on novel tight upper-bounds that we derive. (2) We design a recurrent network architecture that explicitly models sparsity-promoting image priors and is applicable to a wide range of reconstruction problems. (3) We propose a memory efficient training strategy based on implicit back-propagation that does not restrict in any way our network's effective depth.

2. IMAGE RECONSTRUCTION

Let us first focus on how one typically deals with inverse imaging problems of the form: y = Ax + n, where x ∈ R n•c is the multichannel latent image of c channels, that we seek to recover, A : R n•c → R m•c ′ is a linear operator that models the impulse response of the sensing device, y ∈ R m•c ′ is the observation vector, and n ∈ R m•c ′ is a noise vector that models all approximation errors of the forward model and measurement noise. Hereafter, we will assume that n consists of i.i.d samples drawn from a Gaussian distribution of zero mean and variance σ 2 n . Note that despite of the seeming simplicity of this observation model, it is widely used in the literature, since it can accurately enough describe a plethora of practical problems. Specifically, by varying the form of A, Eq. ( 1) can cover many different inverse imaging problems such as denoising, deblurring, demosaicking, inpainting, super-resolution, MRI reconstruction, etc. If we further define the objective function: J (x) = 1 2σ 2 n ∥y -Ax∥ 2 2 + R (x) , where R : R n•c → R + = {x ∈ R|x ≥ 0} is the regularizer (image prior), we can recover an estimate of the latent image x * as the minimizer of the optimization problem: x * = arg min x J (x). Since the type of the regularizer R (x) can significantly affect the reconstruction quality, it is of the utmost importance to employ a proper regularizer for the reconstruction task at hand.

2.1. SPARSE AND LOW-RANK IMAGE PRIORS

Most of the existing image regularizers in the literature can be written in the generic form: R (x) = ℓ i=1 ϕ (G i x) , where G : R n•c → R ℓ•d is the regularization operator that transforms x, G i = M i G, M i = I d ⊗e T i with ⊗ denoting the Kronecker product, and e i is the unit vector of the standard R ℓ basis. Further, ϕ : R d → R + is a potential function that penalizes the response of the d-dimensional transformdomain feature vector, z i = G i x ∈ R d . Among such regularizers, widely used are those that promote sparse and low-rank responses by utilizing as their potential functions the ℓ 1 and nuclear norms (Rudin et al., 1992; Figueiredo et al., 2007; Lefkimmiatis et al., 2013; 2015) . Enforcing sparsity of the solution in some transform-domain has been studied in-depth and is supported both by solid mathematical theory (Donoho, 2006; Candes & Wakin, 2008; Elad, 2010) as well as strong empirical results, which indicate that distorted images do not typically exhibit sparse or low-rank representations, as opposed to their clean counterparts. More recently it has also been advocated that non-convex penalties such as the ℓ p p vector and S p p Schatten-matrix quasi-norms enforce sparsity better and lead to improved image reconstruction results (Chartrand, 2007; Lai et al., 2013; Candes et al., 2008; Gu et al., 2014; Liu et al., 2014; Xie et al., 2016; Kümmerle & Verdun, 2021) . Based on the above, we consider two expressive parametric forms for the potential function ϕ (•), which correspond to weighted and smooth extensions of the ℓ p p and the Schatten matrix S p p quasinorms with 0 < p ≤ 1, respectively. The first one is a sparsity-promoting penalty, defined as: ϕ sp (z; w, p) = d j=1 w j z 2 j + γ p 2 , z, w ∈ R d , while the second one is a low-rank (spectral-domain sparsity) promoting penalty, defined as: ϕ lr (Z; w, p) = r j=1 w j σ 2 j (Z) + γ p 2 , Z ∈ R m×n , w ∈ R r + , with r = min (m, n) . (5) In both definitions γ is a small fixed constant that ensures the smoothness of the penalty functions. Moreover, in Eq. ( 5) σ (Z) denotes the vector with the singular values of Z sorted in decreasing order, while the weights w are sorted in increasing order. The reason for the latter order is that to better promote low-rank solutions we need to penalize more the smaller singular values of the matrix than its larger ones. Next, we define our proposed sparse and low-rank promoting image priors as: R sp (x) = ℓ i=1 ϕ sp (z i ; w i , p) , (6a) R lr (x) = ℓ i=1 ϕ lr (Z i ; w i , p) , where in Eq. (6a ) z i = G i x ∈ R d , while in Eq. (6b) Z i ∈ R c×q is a matrix whose dependence on x is expressed as: vec (Z i ) = G i x ∈ R d , with d = c • q. In words, the j-th row of Z i is formed by the q-dimensional feature vector Z (j,:) i extracted from the image channel x j , j = 1,. . . ,c. The motivation for enforcing the matrices Z i to have low-rank is that the channels of natural images are typically highly correlated. Thus, it is reasonable to expect that the features stored in the rows of Z i , would be dependent to each other. We note that a similar low-rank enforcing regularization strategy is typically followed by regularizers whose goal is to model the non-local similarity property of natural images Gu et al. (2014) ; Xie et al. (2016) . In these cases the matrices Z i are formed in such a way so that each of their rows holds the elements of a patch extracted from the image, while the entire matrix consists of groups of structurally similar patches.

2.2. MAJORIZATION-MINIMIZATION STRATEGY

One of the challenges in the minimization of the overall objective function in Eq. ( 2) is that the image priors introduced in Eq. ( 6) are generally non-convex w.r.t x. This precludes any guarantees of reaching a global minimum, and we can only opt for a stationary point. One way to handle the minimization task would be to employ the gradient descent method. Potential problems in such case are the slow convergence as well as the need to adjust the step-size in every iteration of the algorithm, so as to avoid divergence from the solution. Other possible minimization strategies are variable splitting (VS) techniques such as the Alternating Method of Multipliers (Boyd et al., 2011) and Half-Quadratic splitting (Nikolova & Ng, 2005) , or the Fast Iterative Shrinkage Algorithm (FISTA) (Beck & Teboulle, 2009) . The underlying idea of such methods is to transform the original problem in easier to solve sub-problems. However, VS techniques require finetuning of additional algorithmic parameters, to ensure that a satisfactory convergence rate is achieved, while FISTA works-well under the assumption that the proximal map of the regularizer (Boyd & Vandenberghe, 2004) is not hard to compute. Unfortunately, this is not the case for the regularizers considered in this work. For all the above reasons, here we pursue a majorization-minimization (MM) approach (Hunter & Lange, 2004) , which does not pose such strict requirements. Under the MM approach, instead of trying to minimize the objective function J (x) directly, we follow an iterative procedure where each iteration consists of two steps: (a) selection of a surrogate function that serves as a tight upperbound of the original objective (majorization-step) and (b) computation of a current estimate of the solution by minimizing the surrogate function (minimization-step). Specifically, the iterative algorithm for solving the minimization problem x * = arg min x J (x) takes the form: x k+1 = arg min x Q x; x k , where Q x; x k is the majorizer of the objective function J (x) at some point x k , satisfying the two conditions: Q x; x k ≥ J (x) , ∀x, x k and Q x k ; x k = J x k . ( ) Given these two properties of the majorizer, it can be easily shown that iteratively minimizing Q x; x k also monotonically decreases the objective function J (x) (Hunter & Lange, 2004) . In fact, to ensure this, it only suffices to find a x k+1 that decreases the value of the majorizer, i.e., Q x k+1 ; x k ≤ Q x k ; x k . Moreover, given that both Q x; x k and J (x) are bounded from below, we can safely state that upon convergence we will reach a stationary point. The success of the described iterative strategy solely relies on our ability to efficiently minimize the chosen majorizer. Noting that the data fidelity term of the objective is quadratic, we proceed by seeking a quadratic majorizer for the image prior R (x). This way the overall majorizer will be of quadratic form, which is amenable to efficient minimization by solving the corresponding normal equations. Below we provide two results that allow us to design such tight quadratic upper bounds both for the sparsity-promoting (6a) and the low-rank promoting (6b) regularizers. Their proofs are provided in Appendix A.1. We note that, to the best of our knowledge, the derived upper-bound presented in Lemma 2 is novel and it can find use in a wider range of applications, which also utilize low-rank inducing penalties (Hu et al., 2021) , than the ones we focus here. Lemma 1. Let x, y ∈ R n and w ∈ R n + . The weighted-ℓ p p function ϕ sp (x; w, p) defined in Eq. (4) can be upper-bounded as: ϕ sp (x; w, p) ≤ p 2 x T W y x + pγ 2 tr (W y ) + 2-p 2 ϕ sp (y; w, p) , ∀ x, y where W y = diag (w) I • yy T + γI p-2 2 and • denotes the Hadamard product. The equality in (8) is attained when x = y. Lemma 2. Let X, Y ∈ R m×n and σ(Y ) , w ∈ R r + with r = min (m, n). The vector σ(Y ) holds the singular values of Y in decreasing order while the elements of w are sorted in increasing order. The weighted-Schatten-matrix function ϕ lr (X; w, p) defined in Eq. ( 5) can be upper-bounded as: ϕ lr (X; w, p) ≤ p 2 tr W Y XX T + pγ 2 tr (W Y ) + 2-p 2 ϕ lr (Y ; w, p) , ∀ X, Y where W Y = U diag (w) U T Y Y T + γI p-2 2 and Y = U diag (σ (Y )) V T , with U ∈ R m×r and V ∈ R n×r . The equality in (9) is attained when X = Y . Next, with the help of the derived tight upper-bounds we obtain the quadratic majorizers for both of our regularizers as: Q sp x; x k = p 2 ℓ i=1 z T i W z k i z i , (10a) Q lr x; x k = p 2 ℓ i=1 tr Z T i W Z k i Z i , ) where z i and Z i are defined as in Eq. ( 6), W z k i , W Z k i are defined in Lemmas 1 and 2, respectively, and we have ignored all the constant terms that do not depend on x. In both cases, by adding the majorizer of the regularizer, Q reg x; x k , to the quadratic data fidelity term we obtain the overall majorizer Q x; x k , of the objective function J (x). Since this majorizer is quadratic, it is now possible to obtain the (k+1)-th update of the MM iteration by solving the normal equations: x k+1 = A T A + p•σ 2 n ℓ i=1 G T i W k i G i + αI -1 A T y + αx k = S k + αI -1 b k , where α = δσ 2 n , W k i = W z k i ∈ R d×d for R sp (x) , W k i = I q ⊗W Z k i ∈ R c•q×c•q for R lr (x) . We note that, to ensure that the system matrix in Eq. ( 11) is invertible, we use an augmented majorizer that includes the additional term δ 2 xx k 2 2 , with δ > 0 being a fixed small constant (we refer to Appendix A.1.1 for a justification of the validity of this strategy). This leads to the presence of the extra term αI additionally to the system matrix S k and of αx k in b k . Based on the above, the minimization of J (x), incorporating any of the two regularizers of Eq. ( 6), boils down to solving a sequence of re-weighted least squares problems, where the weights W k i of the current iteration are updated using the solution of the previous one. Given that our regularizers in Eq. (6) include the weights w i ̸ = 1, our proposed algorithm generalizes the IRLS methods introduced by Daubechies et al. (2010) and Mohan & Fazel (2012) , which only consider the case where w i = 1 and have been successfully applied in the past on sparse and low-rank recovery problems, respectively.

3. LEARNING PRIORS USING IRLS RECONSTRUCTION NETWORKS

To deploy the proposed IRLS algorithm, we have to specify both the regularization operator G and the parameters w = {w i } ℓ i=1 , p of the potential functions in Eqs. ( 4) and ( 5), which constitute the image regularizers of Eq. ( 6). Manually selecting their values, with the goal of achieving satisfactory reconstructions, can be a cumbersome task. Thus, instead of explicitly defining them, we pursue the idea of implementing IRLS as a recurrent network, and learn their values in a supervised way. Under this approach, our learned IRLS-based reconstruction network (LIRLS) can be described as: x k+1 = f θ x k ; y , with θ = {G, w, p} denoting its parameters. The network itself consists of three main components: (a) A feature extraction layer that accepts as input the current reconstruction estimate, x k , and outputs the feature maps z k i = G i x k ℓ i=1 . (b) The weight module that acts on z k i and the parameters w i to construct the weight matrix W k i , which is part of the system matrix S k in Eq. ( 11). (c) The least-squares (LS) layer whose role is to produce a refined reconstruction estimate, x k+1 , as the solution of Eq. ( 11). The overall architecture of LIRLS is shown in Fig. 1 . 

3.1. NETWORK TRAINING

A common training strategy for recurrent networks is to unroll the network using a fixed number of iterations and update its parameters either by means of back-propagation through time (BPTT) or by its truncated version (TBPTT) (Robinson & Fallside, 1987; Kokkinos & Lefkimmiatis, 2019) . However, this strategy cannot be efficiently applied to LIRLS. The reason is that the system matrix S k in Eq. ( 11) typically is of huge dimensions and its direct inversion is generally infeasible. Thus, to compute x k+1 via Eq. ( 11) we need to rely on a matrix-free iterative linear solver such as the conjugate gradient method (CG) (Shewchuk, 1994) . This means that apart from the IRLS iterations we need also to take into account the internal iterations required by the linear solver. Therefore, unrolling both types of iterations would result in a very deep network, whose efficient training would be extremely challenging for two main reasons. The first is the required amount of memory, which would be prohibitively large. The second one is the problem of vanishing/exploding gradients that appears in recurrent architectures (Pascanu et al., 2013) , which would be even more pronounced in the case of LIRLS due to its very deep structure. To overcome these problems, we rely on the fact that our IRLS strategy guarantees the convergence to a fixed point x * . Practically, this means that upon convergence of IRLS, it will hold x * = f θ (x * ; y). Considering the form of our IRLS iterations, as shown in Eq. ( 11), this translates to: g (x * , θ) ≡ x * -f θ (x * ; y) = S * (x * , θ) x * -A T y = 0, where we explicitly indicate the dependency of S * on x * and θ. To update the network's parameters during training, we have to compute the gradients ∇ θ L (x * ) = ∇ θ x * ∇ x * L (x * ) , where L is a loss function. Now, if we differentiate both sides of Eq. ( 12) w.r.t θ, then we get: ∂g (x * , θ) ∂θ + ∂g (x * , θ) ∂x * ∂x * ∂θ = 0 ⇒ ∇ θ x * = -∇ θ g (x * , θ) (∇ x * g (x * , θ)) -1 . ( ) Thus, we can now compute the gradient of the loss function w.r.t the network's parameters as: ∇ θ L (x * ) = -∇ θ g (x * , θ) v, ( ) where v is obtained as the solution of the linear problem ∇ x * g (x * , θ) v=∇ x * L (x * ) and all the necessary auxiliary gradients can be computed via automatic differentiation. Based on the above, we can train the LIRLS network without having to restrict its overall effective depth or save any intermediate results that would significantly increase the memory requirements. The implementation details of our network for both its forward and backward passes, are described in Algorithm 1 in Sec. A.4. Finally, while our training strategy is similar in spirit with the one used for training Deep Equilibrium Models (DEQ) (Bai et al., 2019) , an important difference is that our recurrent networks are guaranteed to converge to a fixed point, while in general this is not true for DEQ models.

3.2. LIRLS NETWORK IMPLEMENTATION

In this section we discuss implementation details for the LIRLS variants whose performance we will assess on different grayscale/color image reconstruction tasks. As mentioned earlier, among the parameters that we aim to learn is the regularization operator G. For all the different networks we parameterize G with a valid convolution layer that consists of 24 filters of size 5 × 5. In the case of color images these filters are shared across channels. Further, depending on the particular LIRLS instance that we utilize, we either fix the values of the parameters w, p or we learn them during training. Hereafter, we will use the notations ℓ p,w p and S p,w p to refer to the networks that employ a learned sparse-promoting prior and a learned low-rank promoting prior, respectively. The different LIRLS instances that we consider in this work are listed below: 1. ℓ 1 /S 1 (nuclear): fixed p = 1, fixed w = 1, where 1 is a vector of ones. 2. Weighted ℓ w 1 /S w 1 (weighted nuclear): fixed p = 1, weights w are computed by a weight prediction neural network (WPN). WPN accepts as inputs either the features z 0 = Ĝx 0 for the grayscale case or their singluar values for the color case, as well as the noise standard deviation σ n . The vector x 0 is the estimate obtained after 5 IRLS steps of the pretrained ℓ 1 /S 1 networks, while Ĝ is their learned regularization operator. For all studied problems except of MRI reconstruction, we use a lightweight RFDN architecture proposed by Liu et al. (2020) to predict the weights, while for MRI reconstruction we use a lightweight UNet from Ronneberger et al. (2015) , which we have found to be more suitable for this task. For the S w 1 network, in order to enforce the predicted weights to be sorted in increasing order, we apply across channels of the ouput of WPN a cumulative sum. In all cases, the number of parameters of WPN does not exceed 0.3M. 3. ℓ p p /S p p : learned p ∈ [0.4, 0.9], fixed w = 1. 4. ℓ p,w p /S p,w p : learned p ∈ [0.4, 0.9], weights w are computed as described in item 2, with the only difference being that both x 0 and Ĝ are now obtained from the pretarained ℓ p p /S p p networks. We note that the output of the LS layer, which corresponds to the solution of the normal equations in Eq. ( 11), is computed by utilizing a preconditioned version of CG (PCG) Hestenes & Stiefel (1952) . This allows for an improved convergence of the linear solver. Details about our adopted preconditioning strategy are provided in Appendix A.2. Finally, in all the reported cases, the constant δ related to the parameter α in Eq. ( 11) is set to 8e -4 , while as initial solution for the linear solver we use the output of an independently trained fast FFT-based Wiener filter.

3.3. TRAINING DETAILS

The convergence of the forward pass of LIRLS to a fixed point x * is determined according to the criterion: ||S * (x * , θ) x * -A T y|| 2 /||A T y|| 2 < 1e -4 , which needs to be satisfied for 3 consecutive IRLS steps. If this criterion is not satisfied, the forward pass is terminated after 400 IRLS steps during training and 15 steps during inference. In the LS layers we use at most 150 CG iterations during training and perform an early exit if the relative tolerance of the residual falls below 1e -6 , while during inference the maximum amount of CG iterations is reduced to 50. When training LIRLS, in the backward pass, as shown in Eq. ( 14), we are required to solve a linear problem whose system matrix corresponds to the Hessian of the objective function (2). This symmetric matrix is positive definite when J (x) is convex and indefinite otherwise. In the former case we utilize the CG algorithm to solve the linear problem, while in the latter one we use the Minimal Residual Method (MINRES) (Paige & Saunders, 1975) . We perform early exit if the relative tolerance of the residual is below 1e -2 and limit the maximum amount of iterations to 2000. It should be noted that we have experimentally found these values to be adequate for achieving stable training. All our models are trained using as loss function the negative peak-to-signal-noise-ratio (PSNR) between the ground truth and the network's output. We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate 5e -3 for all the models that do not involve a WPN and 1e -4 otherwise. The learning rate is decreased by a factor of 0.98 after each epoch. On average we set the batch size to 8 and train our models for 100 epochs, where each epoch consists of 500 batch passes.

4. EXPERIMENTS

In this section we assess the performance of all the LIRLS instances described in Sec. 

4.1. TRAIN AND TEST DATA

To train our models for the first three recovery tasks, we use random crops of size 64 × 64 taken from the train and val subsets of the BSD500 dataset provided by Martin et al. (2001) , while for MRI reconstruction we use 320 × 320 single-coil 3T images from the NYU fastMRI knee dataset (Knoll et al. (2020) ). In order to train the deblurring networks we use synthetically created blur kernels varying in size from 13 to 35 pixels according to the procedure described by Boracchi & Foi (2012) . For the super-resolution task we use scale factors 2, 3 and 4 with 25 × 25 kernels randomly synthesized using the algorithm provided by Bell-Kligler et al. (2019) . For accelerated MRI we consider ×4 and ×8 undersampling factors in k-space using conjugate-symmetric masks so that training and inference is performed in the real domain. For the demosaicking problem we use the RGGB CFA pattern. All our models are trained and evaluated on a range of noise levels that are typical in the literature for each considered problem. Based on this, we consider σ n to be up to 1% of the maximum image intensity for deblurring, super-resolution and MRI reconstruction tasks, while for demosaicing we use a wider range of noise levels, with σ n being up to 3%. For our evaluation purposes we use common benchmarks proposed for each recovery task. In particular, for deblurring we use the benchmarks proposed by Sun et al. (2013) and Levin et al. (2009) . For super-resolution we use the BSD100RK dataset, which consists of 100 test images from the BSD500 dataset, and the degradation model proposed by Bell-Kligler et al. (2019) . For demosaicking we use the images from the McMaster (Zhang et al., 2011) dataset. For MRI reconstruction we use the 3T scans from the val subset of the NYU fastMRI knee dataset, where from each scan we take the central slice and two more slices located 8 slices apart from the central one. None of our training data intersect with the data we use for evaluation purposes. We should further note, that all benchmarks we use for evaluation contain diverse sets of images with relatively large resolutions. This strategy is not common for other general purpose methods, which typically report results on a limited set of small images, due to their need to manually fine-tune certain parameters. In our case all network parameters are learned via training and remain fixed during inference. In order to provide a fair comparison, for the methods that do require manual tuning of parameters, we perform a grid search on a smaller subset of images to find the values that lead to the best results. Then we use these values fixed during the evaluation for the entire set of images per each benchmark. Finally, note that for all TV-based regularizers we compute their solutions using an IRLS strategy.

4.2. RESULTS

Deblurring. In Table 1 we report the average results in terms of PSNR and structure-similarity index measure (SSIM) for several of our LIRLS instances, utilizing different sparsity-promoting priors, and few competing methods, namely anisotropic (TV-ℓ 1 ) (Zach et al., 2007) and isotropic Total Variation (Rudin et al., 1992) , RED (Romano et al., 2017) , IRCNN (Zhang et al., 2017) , and FDN (Kruse et al., 2017) . From these results we observe that our LIRLS models lead to very competitive performance, despite the relatively small number of learned parameters. In fact, the ℓ w 1 and ℓ p,w p variants obtain the best results among all methods, including FDN, which is the current grayscale sota method, and IRCNN that involves 4.7M parameters in total (for its 25 denoising networks). Similar comparisons for color images are provided in Table 2 , where we report the reconstruction results of our LIRLS variants, which utilize different learned low-rank promoting priors. For these comparisons we further consider the vector-valued TV (VTV) (Blomgren & Chan, 1998) , Nuclear TV (TVN) Lefkimmiatis et al. (2015) and DWDN (Dong et al., 2020) . From these results we again observe that the LIRLS models perform very well. The most interesting observation though, is that our S p p model with a total of 601 learned parameters manages to outperform the DWD network which involves 7M learned parameters. It is also better than S w 1 and very close to S p,w p , which indicates that the norm order p plays a more crucial role in this task than the weights w. Super-resolution. Similarly to the image deblurring problem, we provide comparisons among competing methods both for grayscale and color images. For these comparisons we still consider the RED and IRCNN methods as before and we further include results from bicubic upsampling and the USRNet network Zhang et al. (2020) . The latter network, unlike RED and IRCNN, is specifically designed to deal with the problem of super-resolution and involves 17M of learned parameters. From the reported results we can see that for both the grayscale and color cases the results we obtain with our LIRLS instances are very competitive and they are only slightly inferior than the specialized USRNet network. For a visual assessment of the reconstructions quality we refer to Fig. 3 . Demosaicking. For this recovery task we compare our low-rank promoting LIRLS models against the same general-purpose classical and deep learning approaches as in color deblurring, plus bicubic upsamplping. The average results of all the competing methods are reported in Table 5 . From these results we come to a similar conclusion as in the debluring case. Specifically, the best performance is achieved by the S p p and S p,w p LIRLS models. The first one performs best for lower noise levels while the latter achieves the best results for the highest ones. Visual comparisons among the different methods are provided in Fig. 4 . MRI Reconstruction. In Table 6 we compare the performance of our LIRLS models with anisotropic and isotropic TV reconstruction and the Fourier back-projection (FBP) method. While these three methods are conventional ones, they can be quite competitive on this task and are still relevant due to their interpretability, which is very important in medical applications. Based on the reported results and the visual comparisons provided in Fig. 5 we can conclude that the LIRLS models do a very good job in terms of reconstruction quality. It is also worth noting that similarly to the other two gray-scale recovery tasks, we observe that the best performance among the LIRLS models is achieved by ℓ w 1 . This can be attributed to the fact that the learned prior of this model while being adaptive, thanks to the presence of the weights, is still convex, unlike ℓ p p and ℓ p,w p . Therefore, its output is not significantly affected by the initial solution. Additional discussions about our networks reconstruction performance related to the choice of p and the weights w is provided in Appendix A.3, along with some possible extensions that we plan to explore in the future.

A APPENDIX

A.1 PROOFS In this section we derive the proofs for the tight upper-bounds presented in Lemma 1 and Lemma 2. For our proofs we use the following inequality for a function |x| p , 0 < p ≤ 2, which is given by Sun et al. (2017) : |x| p ≤ p 2 |y| p-2 x 2 + 2 -p 2 |y| p , ∀x ∈ R and y ∈ R \ {0} (15) and the Ruhe's trace inequality (Marshall et al., 1979) , which is stated in the following theorem: Theorem 1. Let A and B be n × n positive semidefinite Hermittian matrices. Then it holds that: tr (AB) ≥ n i=1 σ i (A) σ n-i+1 (B) , ( ) where σ i (A) denotes the i-th singular value of A and the singular values are sorted in a decreasing order, that is σ i (A) > σ i+1 (A). Proof of Lemma 1. The proof is straightforward and relies on the inequality of Eq. ( 15). Specifically, let us consider the positive scalars x 2 i + γ, y 2 i + γ, with γ > 0. If we plug them in (15) we get: x 2 i + γ p 2 ≤ p 2 y 2 i + γ p-2 2 x 2 i + γ + 2 -p 2 y 2 i + γ p 2 . ( ) Multiplying both sides by a non-negative scalar w i leads to: w i x 2 i + γ p 2 ≤ p 2 y 2 i + γ p-2 2 w i x 2 i + γ + 2 -p 2 w i y 2 i + γ p 2 . ( ) The above inequality is closed under summation and, thus, it further holds that: ϕ sp (x; w, p) = n i=1 w i x 2 i + γ p 2 (19) ≤ p 2 i=1 w i y 2 i + γ p-2 2 x 2 i + γ + 2 -p 2 n i=1 w i y 2 i + γ p 2 = p 2 n i=1 w i y 2 i + γ p-2 2 x 2 i + pγ 2 n i=1 w i y 2 i + γ p-2 2 + 2 -p 2 n i=1 w i y 2 i + γ p 2 = p 2 x T W y x + pγ 2 tr (W y ) + 2 -p 2 ϕ sp (y; w, p) , ∀x, y with W y = diag w 1 y 2 1 + γ p-2 2 , . . . , w n y 2 n + γ p-2 2 = diag (w) I • yy T + γI p-2 2 . By substitution and carrying over the algebraic operations on the r.h.s of Eq. ( 20), we can show that when x = y, the inequality reduces to equality. We note that it is possible to derive the IRLS algorithm that minimizes J (x) of Eq. ( 2) under the weighted ℓ p p regularizers, without relying on the MM framework. In particular, we can redefine the regularizer ϕ sp (•) as: φsp (x; w, p) = n i=1 w i x 2 i + γ p-2 2 x 2 i (21) from where the weights W y of Lemma 1 can be inferred. Then, the convergence of the IRLS strategy to a stationary point can be proven according to (Daubechies et al., 2010) . Unfortunately, this strategy doesn't seem to apply for the weights W Y of Lemma 2. The reason is that the weighted S p p regularizers don't apply directly on the matrix X but instead on its singular values. Specifically, it holds that: ϕ lr (X; w, p) = n i=1 w i σ 2 i (X) + γ p 2 = tr W (X) XX T + γI p 2 . ( ) Unlike the previous case, here the weights w are included in the matrix W (X), which directly depends on X as: W (X) = U (X) diag (w) U T (X), with U (X) being the left singular vectors of X. Therefore it is unclear how the approach by Mohan & Fazel (2012) would apply in this case and how the convergence of IRLS can be established. Proof of Lemma 2. Let us consider the positive scalars σ 2 i (X) + γ, σ 2 i (Y ) + γ. If we plug them in (15) we get: σ 2 i (X) + γ p 2 ≤ p 2 σ 2 i (Y ) + γ p-2 2 σ 2 i (X) + γ + 2 -p 2 σ 2 i (Y ) + γ p 2 . ( ) Multiplying both sides by a non-negative scalar w i leads to: w i σ 2 i (X) + γ p 2 ≤ p 2 w i σ 2 i (Y ) + γ p-2 2 σ 2 i (X) + γ + 2 -p 2 w i σ 2 i (Y ) + γ p 2 . ( ) The above inequality is closed under summation and, thus, it further holds that: ϕ lr (X; w, p) = r i=1 w i σ 2 i (X) + γ p 2 ≤ p 2 r i=1 w i σ 2 i (Y ) + γ p-2 2 σ 2 i (X) + γ + 2 -p 2 r i=1 w i σ 2 i (Y ) + γ p 2 = p 2 r i=1 w i σ 2 i (Y ) + γ p-2 2 σ 2 i (X) + pγ 2 r i=1 w i σ 2 i (Y ) + γ p-2 2 + 2-p 2 ϕ lr (Y ; w, p) = p 2 r i=1 σ r-i+1 (W Y ) σ i XX T + pγ 2 tr (W Y ) + 2-p 2 ϕ lr (Y ; w, p) , where W Y = U diag (w) U T Y Y T + γI p-2 2 and Y admits the singular value decomposition Y = U diag (σ (Y )) V T with U ∈ R m×r , V ∈ R n×r , and r = min (m, n). Further, we show that it holds: W Y = U diag (w) U T Y Y T + γI p-2 2 = U     w 1 σ 2 1 (Y ) + γ p-2 2 . . . 0 . . . 0 . . . w r σ 2 r (Y ) + γ p-2 2     U T = Û     w r σ 2 r (Y ) + γ p-2 2 . . . 0 . . . 0 . . . w 1 σ 2 1 (Y ) + γ p-2 2     Û T (26) = Û diag (σ (W Y )) Û T ∈ R m×m , where Û = U J , with J denoting the exchange matrix (row-reversed identity matrix). We note that the vector σ (W Y ) ∈ R r + , similarly to σ (X) and σ (Y ), holds the singular values of W Y in decreasing order, given that w i+1 σ 2 i+1 (Y ) + γ p-2 2 ≥ w i σ 2 i (Y ) + γ p-2 2 ∀i. This is true because according to the definition of w, it holds w i+1 ≥ w i ∀i, while it also holds σ 2 i+1 (Y ) + γ p-2 2 ≥ σ 2 i (Y ) + γ p-2 2 ∀i, since σ i+1 (Y ) ≤ σ i (Y ) and p-2 2 ≤ 0. Finally, given that both W Y and XX T are positive semidefinite symmetric matrices, we can invoke Ruhe's trace inequality from Theorem 1 and combine it with Eq. ( 25) to get: ϕ lr (X; w, p) ≤ p 2 tr W Y XX T + pγ 2 tr (W Y ) + 2-p 2 ϕ lr (Y ; w, p) ∀X, Y . By substitution and carrying over the algebraic operations on the r.h.s of Eq. ( 28), we can show that when X = Y the inequality reduces to equality.

A.1.1 THEORETICAL JUSTIFICATION FOR USING AN AUGMENTED MAJORIZER

In Section 2.2 where we consider the solution of the normal equations in Eq. ( 11), instead of the majorizers that we derived in Eqs. ( 10), Q reg , we consider their augmented counterparts which are of the form: Q x; x k = Q reg x; x k + α 2 x -x k 2 2 . ( ) The reason is that, under this choice the system matrix of Eq. ( 11) is guaranteed to be non-singular and, thus, a unique solution of the linear system always exists. To verify that this choice doesn't compromise the convergence guarantees of our IRLS approach, we note that Q x; x k is still a valid majorizer and satisfies both properties of Eq. ( 7), required by the MM framework. Specifically, it is straightforward to show that: Q (x; x) = Q reg (x) and Q x k ; x ≥ Q reg ∀x, x k . Finally, we note that the use of the augmented majorizer serves an additional purpose. In particular, due to the term α 2 xx k 2 2 , the majorizer Q enforces the IRLS estimates between two successive IRLS iterations, x k and x k+1 , not to differ significantly. Both the unique solution of the linear system and the closeness of the the successive IRLS estimates play an important role for the stability of the training stage of our LIRLS networks.

A.2 MATRIX EQUILIBRATION PRECONDITIONING

During the training and inference of LIRLS, both the network parameters as well as the samples in the input batches vary significantly. This results in a convergence behavior that is not consistent, which is mostly attributed to the varying convergence rate of the linear solver at each IRLS step. Indeed, it turns out that the main term S k of system matrix, defined in Eq.( 11), in certain cases can be poorly conditioned. To deal with this issue and improve the overall convergence of LIRLS we apply a preconditioning strategy. In particular, we employ a matrix equilibration (Duff & Koster, 2001) such that the resulting preconditioned matrix has a unit diagonal, while its off-diagonal entries are not greater than 1 in magnitude. In our case all the components that form the system matrix are given in operator form, and thus we do not have access to the individual matrix elements of S = S k + αI. For this reason, we describe below the practical technique of forming a diagonal matrix preconditioner that equilibrates the matrix S ∈ R n•c×n•c . We start by noting that such matrix can be decomposed as: S = A T A + p•σ 2 n G T W G + αI = A T √ p•σ n G T W 1/2 √ αI   A √ p•σ n W 1/2 G √ αI   = B T B, where B ∈ R (n1+n2+n3)×n3 , with n 3 = n•c. We further note a simple fact, that any diagonal matrix D = diag (d) multiplied with B from the right (BD) equally scales all the elements of each column B :,j with a corresponding diagonal element d j , and the same diagonal matrix multiplied with B T from the left (DB T ) scales the same way each matrix row of B T . Let us now select the diagonal elements d j of matrix D to be of the form d j = 1/||B :,j || 2 , i.e. the inverse of the ℓ 2 norm of the corresponding column of the matrix B. In this case, the product BD results in a matrix with normalized columns, while the product DB T results in a matrix with normalized rows. The product of the two, i.e., the preconditioned matrix D T B T BD, becomes equilibrated since it has a unit diagonal and all of its non-diagonal elements are smaller or equal to 1. The task then becomes to develop a simple way of calculating the vector d that holds the norms of the matrix B columns. From the definition of matrix B in Eq (31), the squared norm of its j-th column can be computed as: ||B :,j || 2 2 = n1+n2+n3 i=1 B 2 i,j = n1 i=1 A 2 i,j + p•σ 2 n n2 i=1 W 1/2 G 2 i,j + α. All restoration problems under study are large-scale, meaning the matrices A and G and W are structured, but only available in an operator form. Depending on the problem at hand, A is either a valid convolution matrix (deblurring), strided valid convolution matrix (super-resolution), diagonal binary matrix (demosaicing) or orthonormal subsampled FFT matrix (MRI reconstruction), G is a block matrix with valid convolution matrices as blocks, while the matrix W is either a diagonal one, for the case of sparse promoting priors, or a block diagonal matrix with each block being a matrix of dimensions c × c, for the case of low-rank promoting priors. We utilize the following trick in order to calculate both terms n1 i=1 A 2 i,j and n2 i=1 W 1/2 G 2 i,j appearing in Eq. ( 32). In particular, considering any arbitrary matrix C and a vector of ones 1 we note that for possibly strided valid convolution matrices (the following also holds for convolutions with periodic and zero boundaries) used in deblurring and super-resolution problems the Hadamard square operator can be obtained by squaring element-wise all the elements of the convolution kernel, for the demosaicing problem we have A •2 = A, and for the MRI reconstruction with subsampled orthonormal FFT matrix, we compute the squared norms of the columns directly, as it is equal to the corresponding acceleration (sampling) rate. Since the matrix W has a the specific structure described above, and the convolution filter bank operator G is applied independently for each color channel, it is straightforward to show, that for both sparse and low-rank cases the following holds: n2 i=1 W 1/2 G 2 i,j = W 1/2 G •2 T 1 j = G •2 W 1/2 •2 1 j . ( ) As already discussed above, G •2 can be obtained by squaring element-wise all the elements of the convolution filter bank. The computation of W 1/2 •2 is trivial for the sparse case where W is a diagonal matrix for which W 1/2 •2 = W . For the low-rank case we construct W from its eigendecomposition, so all of its eigenvalues and eigenvectors are at hand, meaning that we can easily obtain W 1/2 by computing the square root of eigenvalues of W and composing them with its eigenvectors. Then, W 1/2 •2 is computed easily by squaring element-wise all of the elements of W 1/2 .

A.3 DISCUSSION ON THE PERFORMANCE OF THE LIRLS MODELS

In Section 4.2 we reported the reconstruction results that different LIRLS models have achieved for the studied reconstruction tasks both for grayscale/single-channel and color images. From these results we observe that for different recovery problems, the best performance is not always achieved by the same model. In particular, for the grayscale reconstruction tasks we can see that the ℓ w 1 and ℓ p,w p LIRLS models perform better on average than the ℓ 1 and ℓ p p models. This is a strong indication that the presence of the learned weights w lead to more powerful sparsity-promoting regularizers and have a positive impact on the reconstruction quality. Moreover, we also observe that the best performance among all the models is accomplished by the ℓ w 1 , which can be somehow counterintuitive since in theory the choice of p < 1 should promote sparse solutions better. A possible explanation for this is that the ℓ w 1 regularizer is a convex one and, thus, is amenable to efficient minimization and LIRLS will converge to the global minimum of the objective function J (x) in Published as a conference paper at ICLR 2023 Eq. ( 2). On the other hand, the ℓ p,w p regularizer is non-convex, which means that the stationary point reached by LIRLS can be sensitive to the initialization and might be far from the global minimum. Regarding the color reconstruction tasks and the learned low-rank promoting regularizers that we have considered, we observe that the best performance on average is achieved by the S p p and S p,w p LIRLS models with p < 1. To better interpret this results, we need to have in mind that the only convex regularizer out of this family is the S 1 (nuclear norm). Given that the nuclear norm is not as expressive as the rest of the low-rank promoting regularizers, it is expected that this LIRLS model will be the least performing. Also note that the weighted nuclear norm, S w 1 , unlike to the ℓ w 1 , is nonconvex and thus it does not benefit from the convex optimization guarantees. Based on the above, and having in mind the results we report in Section 4.2, it turns out that in this case the choice of a p < 1 plays a more important role in the reconstruction quality than the learned weights w. Another important issue that is worth of discussing, is the fact that in this work we have applied sparsity-promoting regularization on grayscale reconstruction tasks and low-rank promoting regularization on color recovery tasks. We note that by no means this is a strict requirement and it is possible to seek for low-rank solutions in some transform domain when dealing with grayscale images as in Lefkimmiatis et al. (2012; 2013; 2015) , or sparse solutions when dealing with color images. Due to space limitations we have not explored this cases in this work, but we plan to include related results in an extended version of this paper. Algorithm 1: Forward and backward passes of LIRLS networks. • Sparse case: W k i = diag (wi) I • z k i z k i T + γI p-2 2 . • Low-rank case: W k i = Iq ⊗ U k i diag (wi) U k i T Z k i Z k i T + γI p-2 , where Z k i = U k i diag σ(Z k i ) V k i T . 3. Find the updated solution x k+1 by solving the linear system: x k+1 = A T A + p•σ 2 n ℓ i=1 G T i W k i Gi + αI -1 A T y + αx k . 4. k = k + 1. until the convergence criterion is satisfied; Return x * = x k ; Backward Pass 1. Use x * to compute W * i = W * i (G, x * ) following steps 1 and 2 in the Forward Pass. Then use both to define the following auxiliary network with parameters θ: g (x * , θ) = A T A + p•σ 2 n ℓ i=1 G T i W * i (G, x * ) Gi x * -A T y. 2. Compute v = (∇x * g) -1 ρ by solving the linear system ∇x * g • v = ρ, where ρ = ∇x * L and L is the training loss function. 3. Obtain the gradient ∇ θ L by computing the product ∇ θ g • v. 4. Use ∇ θ L to update the network's parameters θ or backpropagate further into their parent leafs.

A.4 ALGORITHMIC IMPLEMENTATION OF LIRLS NETWORKS

In Algorithm 1 we provide the pseudo-code for the forward and backward passes of the LIRLS models, where we distinguish between the learned low-rank and sparsity promoting scenarios. The gradients in the backward pass can be easily computed using any of the existing autograd libraries.

A.5 EMPIRICAL CONVERGENCE OF LIRLS TO A FIXED POINT

As we have explained in the manuscript, relying on Lemmas 1 and 2 we have managed to find valid quadratic majorizers for both the sparsity-and low-rank promoting regularizers defined in Eq. ( 6). Given that these majorizers satisfy all the necessary conditions required by the MM framework, we can safely conclude that the proposed IRLS strategy will converge to a fixed point. In this section we provide further empirical evidence which support our theoretical justification. In particular, we have conducted several evaluations of the trained LIRLS models. In the first scenario we run LIRLS models for 30 steps for color deblurring and simulated MRI with x4 acceleration on the corresponding datasets described in Subsection 4.1. After that, we calculate the mean PSNR and SSIM scores individually for each step across all images in each dataset. We provide the resulting plots in Fig. 6 . As we can notice, after approximately 25 iterations, both PSNR and SSIM curves for all different learned regularizers have reached an equillibrium and their values do not change. For comparison reasons we also plot the evolution of PSNR and SSIM for standard TV-based regularizers that exist in the literature. These results provide a strong indication that our LIRLS models indeed reach a fixed point, which is well aligned with the theory. Additionally to the previous averaged convergence results, we provide some representative examples of convergence to a fixed point per individual images and models. For this reason we have selected images from grayscale and color super-resolution benchmarks and provide the inference results of ℓ p,w , the value of the objective function J (x)const shifted by a constant value, and the PSNR score across the number of the performed IRLS iterations. The middle rows show the estimated solutions at specific steps, while the bottom rows include the corresponding relative error, i.e. the difference between the current latent estimate and the one from the previous step. The corresponding PSNR values and relative errors are provided for each image. For visualization purposes the images with the difference are normalized by the maximum relative error. From these figures it is clearly evident that the relative error between the current estimate and the one from the previous IRLS iteration gradually decreases and approaches zero. At the same time the value of the objective function approaches a stationary point, and the PSNR value starts saturating at the later iterations. Please note that in Fig. 8 we include results obtained by employing more than the 15 IRLS iterations that we used for the comparisons reported in Sec. 4.2. The reason for this, is that our main purpose here is to experimentally demonstrate that our S p,w p LIRLS model indeed converges to a fixed point and not to increase the computational efficiency of the model.

A.6 WEIGHT PREDICTION NETWORKS ARCHITECTURES

In the main manuscript we have specified a neural network, whose role is to predict the weights w from some initial solution x 0 , that will then be used in the weighted ℓ p p and S p p norms during the optimization stage (see Fig. 1 ). This weight prediction network is chosen to be either a lightweight RFDN architecture proposed by Liu et al. (2020) for the deblurring, super-resolution and demosaicing problems, or a lightweight UNet from Ronneberger et al. (2015) for MRI reconstruction. Below in Table 7 and Table 8 we present a detailed per-layer structure of both networks that we have used in our reported experiments. Table 7 : Detailed per-layer structure of the RFDN Liu et al. (2020) weight prediction network (WPN) that was used to predict the weights w of the weighted ℓ p p (ℓ w 1 , ℓ p,w p ) and weighted S p p (S w 1 , S p,w p ) norms for the deblurring, super-resolution and demosaicing problems. Here "conv" denotes a convolution layer, "relu" denotes a rectified linear unit function (ReLU), "lrelu" denotes a leaky ReLU function with negative slope of 0.05, "sigmoid" denotes a sigmoid function, "sc" denotes a skip-connection, "cat" denotes a concatenation along channels dimension, "maxpool" denotes a max-pooling operation, "interp" denotes a bilinear interpolation, "mul" denotes a pointwise multiplication. For sparsity promoting priors the number of input and output channels is 25 and 24 respectively, while for low-rank promoting priors it is 4 and 3, respectively. Blocks with repeated structures (but not shared weights) are denoted with -"-. For a more detailed architecture description we refer to the codes released by the authors of Liu et al. (2020) 



Figure 1: The proposed LIRLS recurrent architecture.

3.2 on four different image reconstruction tasks, namely image deblurring, super-resolution, demosaicking and MRI reconstruction. In all these cases the only difference in the objective function J (x) is the form of the degradation operator A. Specifically, the operator A has one of the following forms: (a) low-pass valid convolutional operator (deblurring), (b) composition of a low-pass valid convolutional operator and a decimation operator (super-resolution), (c) color filter array (CFA) operator (demosaicking), and (d) sub-sampled Fourier operator (MRI reconstruction). The first two recovery tasks are related to either grayscale or color images, demosaicking is related to color images, and MRI reconstruction is related to single-channel images.

Figure 2: Visual comparisons among several methods on a real blurred color image. Image and blur kernel were taken from Pan et al. (2016). For more visual examples please refer to Appendix A.7.

= C •2 T 1 j , where C •2 ≡ C • C is the Hadamard square operation and • denotes the Hadamard product. The computation of the Haramard square for operators A is straightforward:

Inputs: x0: initial solution, y: degraded image, A: degradation operator Input parameters: θ = {G, w, p}: network parameters, σ 2 n , α, γ Forward Pass Initialize: k = 0; repeat 1. Compute the feature maps z k i = Gix k ℓ i=1 , Z k i = vec z k i for the low-rank case . 2. Compute the updated W k i weight matrices based on the current estimate x k :

and S p,w p , respectively in Figs.7,8. The plots in the top row of each figure depict the evolution of the relative tolerance rtol = ||x k -x k-1 || x k

Comparisons on grayscale image deblurring.

Comparisons on color image deblurring.

Comparisons on grayscale image super-resolution.

Comparisons on color image super-resolution.

Comparisons on image demosaicking.

Comparisons on MRI reconstruction.

, which we have used without any modifications: https://github.com/njulj/RFDN.

5. CONCLUSIONS

In this work we have demonstrated that our proposed IRLS method, which covers a rich family of sparsity and low-rank promoting priors, can be successfully applied to deal with a wide range of practical inverse problems. In addition, thanks to its convergence guarantees we have managed to use it in the context of supervised learning and efficiently train recurrent networks that involve only a small number of parameters, but can still lead to very competitive reconstructions. Given that most of the studied image priors are non-convex, an interesting open research topic is to analyze how the initial solution affects the output of the LIRLS models and how we can exploit this knowledge to further improve their reconstruction performance. Finally, it would be interesting to explore whether learned priors with spatially adaptive norm orders p, can lead to additional improvements. Knoll et al. (2020) and described in Subsection 4.1.

Input

Step 1Step 2Step 3Step 4Step 5Step 6Step 15  Step 1Step 3Step 5Step 7Step 9Step 11Step 50 2015) weight prediction network (WPN) that was used to predict the weights w of the weighted ℓ p p norm (ℓ p,w p ) for the MRI reconstruction problem. Here "conv" denotes a convolution layer, "up-conv" denotes a transpose convolution layer, "relu" denotes a rectified linear unit function (ReLU), "norm" denotes an instance normalization layer Ulyanov et al. (2016) , "sc" denotes a skip-connection, "cat" denotes a concatenation along channels dimension, "maxpool" denotes a max-pooling operation.

Block

Layer Kernel Size Stride Input Channels Output Channels BiasIn this section we provide additional visual comparisons among the competing methods for all the inverse problems under study. For each reconstructed image its PSNR value is provided in dB.

