Variational Learning ISTA

Abstract

Compressed sensing combines the power of convex optimization techniques with a sparsity inducing prior on the signal space to solve an underdetermined system of equations. For many problems, the sparsifying dictionary is not directly given, nor its existence can be assumed. Besides, the sensing matrix can change across different scenarios. Addressing these issues requires solving a sparse representation learning problem, namely dictionary learning, taking into account the epistemic uncertainty on the learned dictionaries and, finally, jointly learning sparse representations and reconstructions under varying sensing matrix conditions. We propose a variant of the LISTA architecture that incorporates the sensing matrix into the architecture. In particular, we propose to learn a distribution over dictionaries via a variational approach, dubbed Variational Learning ISTA (VLISTA), which approximates a posterior distribution over the dictionaries as part of an unfolded LISTA-based recovery network. Such a variational posterior distribution is updated after each iteration, and thereby adapts the dictionary according to the optimization dynamics. As a result, VLISTA provides a probabilistic way to jointly learn the dictionary distribution and the reconstruction algorithm with varying sensing matrices. We provide theoretical and experimental support for our architecture and show that it learns calibrated uncertainties.

1. Introduction

Compressed sensing methods aim at solving under-determined inverse problems imposing a prior about signal structure. Sparsity and linear inverse problems were canonical examples of the signal structure and sensing mediums (modelled with a linear transformation Φ). Many works during recent years focused on improving the performance and complexity of compressed sensing solvers for a given dataset. A typical approach is based on unfolding iterative algorithms as layers of neural networks and learning the parameters end-to-end starting from learning iterative soft thresholding algorithm (LISTA) Gregor & LeCun (2010) with many follow-ups works. Varying sensing matrices and unknown sparsifying dictionaries are some of the main challenges of data-driven approaches. The works in Aberdam et al. (2021) ; Schnoor et al. (2022) address these issues by learning a dictionary and include it in the optimization iteration. However, the data samples might not have any exact sparse representations, which means that there is no ground truth dictionary. The issue can be more severe for heterogeneous datasets where the choice of the dictionary might vary from one sample to another. A principled approach to this problem would be to take a Bayesian approach and define a distribution over the learned dictionaries with proper uncertainty quantification. In this work, first, we formulate an augmented LISTA-like model, termed Augmented Dictionary Learning ISTA (A-DLISTA), that can adapt its parameters to the current data instance. We theoretically motivate such a design and empirically prove that it can outperform other LISTA-like models in a non-static measurement scenario, i.e., considering varying sensing matrices across data samples. We are aware that an augmented version of LISTA, named Neurally Augmented ALISTA (NALISTA), was already proposed in Behrens et al. (2021) , however, there are some fundamental differences between NALISTA and A-DLISTA. First, our model takes as input the per-sample sensing matrix and the dictionary at the current layer. This means that A-DLISTA adapts the parameters to the current measurement setup as well as to the learned dictionaries. In contrast, NALISTA assumes to have a fixed sensing matrix to analytically evaluate its weight matrix, W. Hypothetically, NALISTA could handle varying sensing matrices, however, that comes at the price of having to solve for each data sample the inner optimization step to evaluate the W matrix. Moreover, the architectures of the augmentation networks are profoundly different. Indeed, while NALISTA uses an LSTM, A-DLISTA employ a convolutional neural network, shared across all layers. Such a different choice reflects the different types of dependencies between layers and input data that the networks try to model. We report in subsection 3.3 a detailed discussion about the theoretical motivation and architectural design for A-DLISTA. Moreover, the detailed architecture is described in Appendix A. Finally, we introduce Variational Learning ISTA (VLISTA) where we learn a distribution over dictionaries and update it after each iteration based on the outcome of the previous layer. In this sense, our model learns an adaptive iterative optimization algorithm where the dictionary is iteratively refined for the best performance. Besides, the uncertainties estimation provides an indicator for detecting Out-Of-Distribution (OOD) samples. Intuitively, our model can be understood as a form of a recurrent variational autoencoder, e.g., Chung et al. (2015) , where on each iteration of the optimization algorithm, we have an approximate posterior distribution over the dictionaries, conditioned on the outcome of the last iteration. The main contributions of our work are as follows. • We design an augmented version of LISTA, dubbed A-DLISTA, that can handle non-static measurement setups, i.e., per-sample sensing matrices, and that can adapt parameters to the current data instance. • We propose Variational Learning ISTA (VLISTA) that learns a distribution over sparsifying dictionaries. The model can be interpreted as a Bayesian LISTA model that leverage A-DLISTA as the likelihood model. • VLISTA adapts the dictionary to optimization dynamics and therefore can be interpreted as a hierarchical representation learning approach, where the dictionary atoms gradually permit more refined signal recovery. • The dictionary distributions can be used for out-of-distribution detection. The remaining part of the paper is organized as follows. In section 2 we briefly report related works that are relevant to the current research, while in section 3 the model formulation is detailed. The datasets description, as well as the experimental results, are reported in section 4. Finally, we report our conclusion in section 5.

2. Related Works

Compressed sensing field is abound with works on theoretical and numerical analysis of recovery algorithms (see Foucart & Rauhut (2013) ) with iterative algorithms as one of the central approaches like Iterative Soft-Thresholding Algorithm (ISTA) Daubechies et al. (2004) , Approximate message passing (AMP) Donoho et al. (2009) Orthogonal Matching Pursuit (OMP) Pati et al. (1993) ; Davis et al. (1994) and Iterative Hard-Thresholding Algorithm (IHTA) Blumensath & Davies (2009) . The mentioned algorithms are characterized by a specific set of hyperparameters, e.g., number of iterations and soft thresholds, that can be tuned to obtain a better trade-off between performance and complexity. With unfolding iterative algorithms as layers of neural networks, these parameters can be learned in an end-to-end fashion from a dataset, see for instance some variants Zhang & Ghanem (2018) ; Metzler et al. (2017); yang et al. (2016) ; Borgerding et al. (2017) ; Sprechmann et al. (2015) . Bayesian Compressed Sensing (BCS) and Dictionary learning. A non-parametric Bayesian approach to dictionary learning has been introduced in Zhou et al. (2009 Zhou et al. ( , 2012)) , where the authors consider a fully Bayesian joint compressed sensing inversion and dictionary learning. Besides, their atoms are drawn and fixed a priori. Bayesian compressed sensing Ji et al. (2008) leverages relevance vector machines (RVMs) Tipping ( 2001) and uses a hierarchical prior to model distributions of each entry. This line of work quantifies uncertainty of recovered entries while assuming a fixed dictionary. In contrast, in our work, the source of uncertainty is the unknown dictionary over which we define a distribution. LISTA models. (2014) . When there are data-sample specific dictionaries in our proposed model, it reminisces extensions of VAEs to the recurrent setting Chung et al. (2015 Chung et al. ( , 2016)) , which assumes a sequential structure in the data and imposes temporal correlations between the latent variables. There are also connections and similarities to Markov state-space models, such as the ones described at Krishnan et al. (2017) . Bayesian Deep Learning. When we employ global dictionaries in VLISTA, the model essentially becomes a variational Bayesian Recurrent Neural Network. Variational Bayesian neural networks have been introduced at Blundell et al. (2015) , with independent priors and variational posteriors for each layer. This work has been further extended to recurrent settings at Fortunato et al. (2019) . The main difference between these works with our setting is the prior and variational posterior; in our case where the prior and variational posterior for each step is conditioned on previous steps, instead of being fixed across steps. 3 Variational Learning ISTA In this section, we briefly report on the ISTA and LISTA models to solve linear inverse problems. Then, we introduce our first model, A-DLISTA, capable of learning the sparsifying dictionary and adapting to different sensing matrices. Finally, we focus on the VISTA model, a variational framework for solving linear inverse problems that leverages A-DLISTA as the likelihood model and achieves high power to reject OOD samples.

3.1. Linear inverse problems

We consider the following linear inverse problem: y = Φx. The matrix Φ is called the sensing matrix. If the vector x is sparse in a dictionary basis Ψ, the problem can be cast as a sparse recovery problem y = ΦΨz with z given as a sparse vector. A proximal gradient descent-based approach to this problem yields ISTA iterations: z t = η θt z t-1 + γ t (ΦΨ) H (y -ΦΨz t-1 ) , where θ t , γ t > 0 are hyper-parameters of the model meaning that the algorithm does not possess any trainable parameters. Generally speaking, γ t is called the step size and its value is given as the inverse of the spectral norm of the matrix A, where A = ΦΨ. The hyper-parameter θ t is termed threshold and it is the value characterizing the so-called soft-threshold function given by: η θ (x) = sign(x)(|x| -θ) + . In the ISTA formulation, those two parameters are shared across all the iterations. Therefore, we have γ t , θ t → γ, θ.

3.2. LISTA

LISTA Gregor & LeCun (2010) is a reparametrized unfolded version of the ISTA algorithm in which each iteration, or layer, is characterized by learnable matrices. Specifically, LISTA reinterpret Equation 1 as defining the layer of a feed-forward neural network implemented as S θt (V t x t-1 + W t y) where V t , W t are learnt from a dataset. In that way, those weights implicitly contain information about Φ and Ψ. However, in many problems, the dictionary Ψ is not given, and the sensing matrix Φ can change for each sample in the dataset. As LISTA, also its variations, e.g., Analytic LISTA (ALISTA) Liu et al. (2019) , NALISTA Behrens et al. (2021) and HyperLISTA Chen et al. (2021) , require similar constraints such a fix dictionary and sensing matrix. Thus, making those algorithms fail in situations where either Φ is not fixed or Ψ is not known.

3.3. Augmented Dictionary Learning ISTA

To deal with situations where the underlying dictionary is not known, and moreover the sensing matrix is changing across samples, one can use an unfolded version of ISTA in which the dictionary is considered as a learnable matrix, termed Dictionary Learning ISTA (DLISTA), for which each layer is given as follows: z t = η θt z t-1 + γ t (ΦΨ t ) ⊤ (y -ΦΨ t z t-1 ) , with one last linear layer mapping z to reconstructed input. The model can be trained end to end to learn all θ t , γ t , Ψ t . The base model is very similar to Behboodi et al. (2022) ; Aberdam et al. ( 2021) but as we will see further, it requires additional changes. To see this, consider the basic scenario where the sensing matrix is fixed to Φ, there is a ground-truth (unknown) dictionary Ψ o such that x * = Ψ o z * with sparse z * having support S, i.e., supp(z * ) = S. Consider the layer t of DLISTA with a fixed sensing matrix Φ, and define: μ := max 1≤i̸ =j≤N ((ΦΨ t ) i ) ⊤ (ΦΨ t ) j (3) μ2 := max 1≤i,j≤N ((ΦΨ t ) i ) ⊤ (Φ(Ψ t -Ψ o )) j (4) δ(γ) := max i 1 -γ ∥(ΦΨ t ) i ∥ 2 2 (5) The term μ is called the mutual coherence of the matrix ΦΨ t . The term μ2 is closely connected to generalized mutual coherence, however it differs in that unlike generalized mutual coherence, it includes the diagonal inner product for i = j. Finally, the term δ(γ) is the reminiscent of restricted isometry property (RIP) constant Foucart & Rauhut (2013) , a key condition for many recovery guarantees in compressed sensing. Note that there is a dependency on γ. For simplicity, we only kept the dependence on γ in the notation and dropped the dependence of μ, μ2 and δ on Φ and Ψ t from the notation. The following theorem provides conditions on each layer improving the reconstruction error. Theorem 3.1. Consider the layer t of DLISTA given by equation 2, and suppose that y = ΦΨ o z * with supp(z * ) = S. We have 1. Suppose z t-1 has the same support as z * , i.e., supp(z * ) = supp(z t-1 ). If γ t (μ ∥z * -z t-1 ∥ 1 + μ2 ∥z * ∥ 1 ) ≤ θ t , (6) then supp(z t ) ⊆ supp(z * ). 2. Assuming that the conditions of the last step hold, then we get the following bound on the error: ∥z t -z * ∥ 1 ≤ (δ(γ t ) + γ t μ(|S| -1)) ∥z t-1 -z * ∥ 1 + γ t μ2 |S| ∥z * ∥ 1 + |S|θ t . We provide the derivations in the supplementary materials. Theorem 3.1 provides insights about the choice of γ t and θ t , and also suggests that (δ(γ t ) + γ t μ(|S| -1)) needs to be smaller than one to reduce the error at each step. Similar to many existing works in the literature, Theorem 3.1 emphasizes the role of small mutual coherence, equation 3, for good convergence. Looking at the theorem, it can be seen that γ t and θ t play crucial role for the convergence. However, there is trade-off underlying these choices. Let's fix θ t . Decreasing γ t can guarantee good support selection but can increase δ(γ t ). When the sensing matrix is fixed, the network can hopefully find good choices by end-to-end training. However, when the sensing matrix Φ changes across different data samples, i.e., Φ → Φ i , it is not guaranteed anymore that there is a unique choice of γ t and θ t for all different Φ i . Since these parameters can be determined for a fixed Φ and Ψ t , we propose using an augmentation network that determines γ t and θ t from each pair of Φ and Ψ t . Following from theory, we show in Figure 1 the resulting model named A-DLISTA. {Φ ! } 𝑧" = 0 𝜃# ! , 𝛾# ! ≔ 𝑧! = 𝜂" ! "#$! 𝑧!#$ + 𝛾! "#$! Φ % Ψ! & 𝑦 % -Φ % Ψ! 𝑧!#$ ≔ 𝑓 Φ % ⋅ Ψ! → 𝜃! % , 𝛾! % {Ψ#} 𝜃# $ , 𝛾# $ {Ψ$} 𝜃# % , 𝛾# % {Ψ%} {𝑦 ! , Φ ! } 𝑧# 𝑧$ … 𝑧&'()% Figure 1 : A-DLISTA architecture. The blue blocks represent a single soft-thresholding operation parametrized by the dictionary Ψ t together with threshold and step size {θ t , γ t } at layer t. The red blocks represent the augmentation network (with shared parameters across layers) that adapts {θ t , γ t } for layer t based on the dictionary Ψ t and the current measurement setup Φ i for the i-th data sample. The dashed arrows connecting each blue block with a red one mean that, at each iteration, the augmentation newtork receives the learned dictionary at the current iteration as input (together with the sensing matrix). A-DLISTA relies on two basic operations at each layer, namely, soft-threshold (blue blocks in Figure 1 ) and augmentation (red blocks in Figure 1 ). The former represents an ISTA-like iteration parametrized by the set of learnable weights: {Ψ t , θ t , γ t }, whilst the latter is implemented using an encoder-decoder-like type of network. As shown in the figure, the augmentation network takes as input the sensing matrix for the given data sample, Φ i , together with the dictionary learned at the layer for which the augmentation model will generate the θ and γ parameters. Through such an operation, the A-DLISTA adapts those last two parameters to the current data sample. We report more details about the augmentation network in Appendix A.

3.4. VLISTA

Although A-DLISTA possesses adaptivity to data samples, it is still based on the assumption that a ground truth dictionary exists. We relax that hypothesis by defining a probability distribution over the sparsifying dictionary and formulate a variational approach, titled VLISTA, to jointly solve the dictionary learning and the sparse recovery problems. To forge our variational framework whilst retaining the helpful adaptivity property of A-DLISTA, we re-interpret the softthresholding layers of the latter as part of a likelihood model that defines the output mean for the reconstructed signal. Given its recurrent-like structure Chung et al. (2015) , we equip VLISTA with a conditional trainable prior where the condition is given by the dictionary sampled at the previous iteration. Therefore, the full model comprises three components, namely, the conditional prior p ξ (•), the variational posterior q ϕ (•), and the likelihood model, p Θ (•). All components are parametrized by neural networks whose outputs represent the parameters for the underlying probability distribution. In what follows, we describe more in detail the various building blocks of the VLISTA model.

3.4.1. Prior distribution over dictionaries

The conditional prior, p ξ (Ψ t |Ψ t-1 ), is modelled as a Gaussian distribution with parameters conditioned on the previously sampled dictionary. We parametrize p ξ (•) using a neural network, f ξ (•) = [f µ ξ1 • g ξ0 (•), f σ 2 ξ2 • g ξ0 (•)], with trainable parameters ξ = {ξ 0 , ξ 1 , ξ 2 }. The model's architecture comprises a shared convolutional block followed by two different branches generating the mean and the standard deviation, respectively, of the Gaussian distribution. Therefore, at layer t, the prior conditional distribution is given by: p ξ (Ψ t |Ψ t-1 ) = i,j N (Ψ t,i,j |µ t,i,j = f µ ξ1 (g ξ0 (Ψ t-1 )) i,j ; σ t,i,j = f σ 2 ξ2 (g ξ0 (Ψ t-1 )) i,j ) , where the indices i, j run over the rows and columns of Ψ t . In order to simplify our expressions, we will abuse notation and refer to distributions like the former as p ξ (Ψ t |Ψ t-1 ) = N (Ψ t |µ t = f µ ξ1 (g ξ0 (Ψ t-1 )); σ 2 t = f σ 2 ξ2 (g ξ0 (Ψ t-1 )) ). We will use the same type of notation throughout the rest of the manuscript to simplify formulas. The prior's design allows for enforcing a dependence of the dictionary at iteration t to the one sampled at the previous iteration. Thus, allowing us to refine the Ψ as the iterations proceed. The only exception to such a process is the prior imposed over the dictionary at t = 1, since there is no a previously sampled dictionary in this case. We handle such an exception by assuming a standard Gaussian distributed Ψ 1 . Finally, the joint prior distribution over the dictionaries for VLISTA is given by: p ξ (Ψ 1:T ) = N (Ψ 1 |µ = 0; σ 2 = 1) T t=2 N (Ψ t |µ t = f µ ξ1 (g ξ0 (Ψ t-1 )); σ 2 t = f σ 2 ξ2 (g ξ0 (Ψ t-1 )))

3.4.2. Posterior distribution over dictionaries

Similarly to the prior model, the variational posterior too is modeled as a Gaussian distribution parametrized by a neural network f ϕ (•) = [f µ ϕ1 • h ϕ0 (•), f σ 2 ϕ2 • h ϕ0 (•) ] which outputs the mean and variance for the underlying probability distribution: q ϕ (Ψ t | xt-1 , y i , Φ i ) = N (Ψ t |µ = f µ ϕ1 (h ϕ0 ( xt-1 , y i , Φ i )); σ 2 = f σ 2 ϕ2 (h ϕ0 ( xt-1 , y i , Φ i ))). The posterior distribution is conditioned on the data, {y i , Φ i }, as well as on the reconstructed signal at the previous layer, xt-1 . Therefore, the joint posterior probability over the dictionaries at each layer is given by: q ϕ (Ψ 1:T | x1:T , y i , Φ i ) = T t=1 q ϕ (Ψ t | xt-1 , y i , Φ i ) (8)

3.4.3. Likelihood model

At the heart o the reconstruction module there is the soft-thresholding block of A-DLISTA. Similarly to the prior and posterior, the likelihood distribution is modelled as a Gaussian parametrized by the output of a A-DLISTA block. Specifically, the likelihood network generates the mean vector only for the Gaussian distribution since we treat the standard deviation as a tunable hyper-parameter. Therefore, we interpret the reconstructed sparse vector at a given layer as the mean of the likelihood distribution. The joint log-likelihood distribution can then be formulated as: log p Θ ( x1:T |Ψ 1:t , y i , Φ i ) = T t=1 log N (µ t = A-DLISTA(Ψ 1:t , y i , Φ i ; Θ), σ 2 t = δ) where δ is a hyper-parameter of the network. We train all the different components of VLISTA, in an end-to-end fashion by the maximization of the Evidence Lower Bound (ELBO). The full objective function is given by: ELBO = T t=1 E Ψ1:t∼q ϕ (Ψ1:t|y i ,Φ i , x0:t-1) log p Θ (x t = x i gt |Ψ 1:t , y i , Φ i ) (10) - T t=2 E Ψ1:t-1∼q ϕ (Ψ1:t-1|y i ,Φ i , xt-1) D KL q ϕ (Ψ t |y i , Φ i , xt-1 ) ∥ p ξ (Ψ t |Ψ t-1 ) -D KL q ϕ (Ψ 1 | x0 ) ∥ p ξ (Ψ 1 ) The first term in Equation 10 represents the likelihood contribution whilst the second and third terms account for the KL divergence. We report more details about models' architecture and the objective function in Appendix A and Appendix B, respectively.

4. Experimental Results

To assess the performance of the proposed approach, we employ three datasets, namely, MNIST, CIFAR10, and a synthetic one. We compare our models' performance against ISTA, LISTA Gregor & LeCun (2010) , and BCS Ji et al. (2008) . However, we do not consider other LISTA variations such as ALISTA Liu et al. (2019) or NALISTA Behrens et al. (2021) since assuming a varying measurement setup across the dataset requires solving an inner optimization problem to evaluate the W matrix for each data sample. As a result, training such models is extremely slow. Moreover, to prove the benefit of adaptivity, we conduct an ablation study on A-DLISTA by removing its augmentation network and making the parameters θ t , γ t learnable through backprop. We refer to the non-augmented version of A-DLISTA as DLISTA (see subsection 3.3 for more details). Hence, for DLISTA, θ t and γ t cannot be adapted anymore to the specific input sensing matrix. For all models that we train, we consider three layers. However, being ISTA a classical method with no learning properties, we also considered the results obtained using 1000 iterations. To consider a scenario with varying sensing matrices, we adopt the following procedure. For each data sample in the training and test sets, x i , we generate a sensing matrix, Φ i , by randomly sampling its entries from a standard distribution. Subsequently, for each pair of sensing matrix and ground truth signal, we generate the corresponding observations as y i = Φ i • x i . We report more details about the training of the models in Appendix B.

4.1. MNIST & CIFAR10

The first task we test our models against is image reconstruction considering the MNIST and CIFAR10 datasets. We report the results in terms of the Structural Similarity Index Measure (SSIM) considering the following setups. We fix the number of layers, or iterations, for all models to three and then we measured SSIM varying the number of measurements. To compute the observation vector (y i ) we generate a different sensing matrix, for each digit, by sampling its entries from a standard Gaussian distribution (more details about the data generation can be found in Appendix B). The results are reported in Table 1 and Table 2 for MNIST abd CIFAR10, respectively. 1 and Table 2 , we can draw the following conclusions. Concerning the non-Bayesian models, our A-DLISTA model outperforms all the others. Moreover, by comparing the performance of A-DLISTA with its non-augmented version, i.e., DLISTA, we see the benefits of using an augmentation network to make the model adaptive. Instead, concerning the Bayesian approaches, our VLISTA model outperforms BCS. Especially, we see that our models outperform others considering a low number of measurements. Concerning the lower performance of VLISTA compared to A-DLISTA, we can mention a few reasons that could explain such behaviour. One contribution to such a difference might come from the noise that is naturally injected at training time due to the random sampling procedure to generate the dictionaries. Also, another contribution is represented by the amortization gap that affects all models based on amortized variational inference. However, although VLISTA shows lower performance than A-DLISTA, it is important to notice that it still performs better than BCS. Moreover, it can detect OODs, a characteristic that Bayesian models only possess.

4.2. Synthetic Dataset

To generate the synthetic dataset, we follow a similar protocol as inLiu & Chen (2019). First, we generate a sensing matrix and a sparsifying dictionary, for each data sample, by sampling their entries from a standard Gaussian distribution. Then, the components of the ground truth sparse signal, z * , are sampled from a standard Gaussian distribution as well. Finally, some of the components of z * are set to zero as dictated by a Bernoulli distribution with p = 0.1. The overall dataset accounts for 1000 samples shared across the train and test sets. To compare the performance of different models, first, we draw the c.d.f of the Normalized Mean Square Error (NMSE) on the test set and then we compute the 40% quantile. Results are reported in Table 3 . Similarly to the setup we used in the previous section, also for the synthetic dataset we fix the number of layers, or iterations, for each model to three and then we varied the number of measurements. -3.67 -8.20 -13.94 -15.31 -14. 02 By looking at Table 3 we can draw a similar conclusion as for the MNIST and CIFAR10 datasets.

4.3. Out Of Distribution detection

In this section, we focus on one of the most important differences among non-Bayesian models for solving inverse linear problems and VLISTA. Indeed, differently from any non-Bayesian approach to compressed sensing, VLISTA allows for quantifying uncertainties on the reconstructed signals which, in turn, enables OOD detection without the need to access ground truth data at inference time. Moreover, whilst other Bayesian approaches Ji et al. (2008) ; Zhou et al. (2014) usually focus on designing specific priors to satisfy the sparsity constraint on the reconstructed signal after marginalization, VLISTA completely overcomes such an issue as the thresholding operations is not affected by the the marginalization over dictionaries. To prove that VLISTA can detect OOD samples, we employ the MNIST dataset. First, we split the full dataset into two subsets named "Train", or In-Distribution (ID), and OOD. The ID subset contains images from three digits only, namely, 0, 3, and 7 (randomly chosen). Instead, the OOD subset contains images from all the other digits. Then, we split the ID partition into training and test sets and train VLISTA on the former one. Once trained, we evaluate the model performance by considering reconstructions from the test set (ID) and the OOD partition. We reconstruct 100 times every single image, sampling every time a new dictionary. Subsequently, as a summarizing statistics we compute the variance's c.d.f. of the per-pixel standard deviation (var σpp ) across reconstructions. Subsequently, to assess whether a given digit belongs to the ID or OOD distribution, we compute the p-value for var σpp by employing the two-sample t-test. Moreover, to assess whether the OOD detection is robust to measurement noise, we repeat the same test for different levels of noise. As a baseline for the current task, we consider BCS. Due to the different nature of the BCS framework, we employ a slightly different procedure to evaluate the p-values for it. Specifically, we use the same ID and OOD splits as for VLISTA. However, for BCS, we consider the c.d.f. of the reconstruction error that is evaluated by the model itself. The rest of the procedure is the same as for VLISTA. We report the results for OOD detection in Figure 2 . As we can see from the figure, VLISTA outperforms BCS for each level of noise showing a lower p-value than BCS which corresponds to a higher rejection power. As expected, by increasing the level of noise we observe a larger p-value meaning that the OOD rejection becomes harder for more noisy data. However, we can see that whilst VLISTA is still capable of detecting OOD samples, BCS fails in doing so when the Signal-to-Noise Ratio (SNR), expressed in decibels, is greater than 10. As a reference point to define whether the model is correctly rejecting OOD samples or not, we report in Figure 2 the 5% line for the p-value. Such a value is typically used as a reference in hypothesis testing to decide whether or not to reject the null hypothesis.

5. Conclusion

We report about a variational approach, dubbed VLISTA, to solve the dictionary learning and the sparse recovery problems jointly. Typically, compressed sensing frameworks assume the existence of a ground truth dictionary used to reconstruct the signal. Furthermore, in state-of-the-art LISTA-like models, a stationary measurement setup is usually considered. In our work, we relax both assumptions. First, we show that it is possible to design a soft-thresholding algorithm, termed A-DLISTA, that can handle different sensing matrices and that can adapt its parameters to the given data instance. We theoretically justify the use of an augmentation network which adapts the threshold and step size for each layer based on the current input and the learned dictionary. Finally, we also relax the hypothesis concerning the existence of a ground truth dictionary by introducing a probability distribution for it. Given such an assumption, we formulate the VLISTA variational framework to solve the compressed sensing task. We report results for both our models, A-DLISTA and VLISTA, concerning non-Bayesian and Bayesian approaches to solve the sparse recovery and dictionary learning problems jointly. We empirically show that the adaptation capability of A-DLISTA results in a boost in performance compared to ISTA and LISTA models, in a non-static measurements scenario. Although we observe that in terms of reconstruction, VLISTA does not outperform A-DLISTA, the variational framework enables us to evaluate uncertainties over the reconstructed signals useful to detect OOD. On the other hand, none of the LISTA-like models allows for such a task. Moreover, differently from other Bayesian approaches to compressed sensing, VLISTA does not need to design specific priors to retain sparsity after marginalization of the reconstructed sparse signal; the averaging operation concerns the sparsifying dictionary instead of the sparse signal itself.

A Architecture Details

In this section we report the details of the architecture for our proposed models. Concerning A-DLISTA, as we show in Figure 1 , the reconstruction network, i.e., blue blocks, is an unfolded ISTA-like model with parametrized dictionary Ψ t . Each layer is characterized by its own dictionary which used to both reconstruct the sparse vector and as an input for the augmentation network. As mentioned in subsection 3.3, the augmentation model, red block in Figure 1 , takes as input the measurement matrix, Φ i , and the dictionary at a given reconstruction layer t, Ψ t , and generates the adaptive parameters {γ t , θ t } for the t-th layer. We show the architecture for the augmentation network in Figure 3 . Instead, the prior (subsubsection 3.4.1) and posterior (subsubsection 3.4.2) models are implemented using an encoder-decoder scheme based on convolutional layers. We report in Figure 4 the architecture for the prior and posterior mdoels. Finally, we report in Figure 5 the graphical model for the posterior and conditional prior in the left and right plot, respectively.

B Implementation and Training Details

We report in this section a few details about the implementation and training of the A-DLISTA and VLISTA models. We implemented both using the Lightning framework. As we mentioned in the main body of the manuscript, ISTA and LISTA require a known dictionary in order to reconstruct the non-sparse signal. Concerning the three datasets that we consider, we define the dictionaries in the following way: canonical for MNIST (since MNIST is already sparse) with 784 atoms and wavelet for CIFAR10 with 1024 atoms. Regarding the synthetic dataset, we randomly generated the dictionary from a standard distribution and we consider 765 atoms. Concerning A-DLISTA, we trained both the reconstruction and the augmentation network, blue and red blocks in Figure 1 , respectively, end-to-end using the Adam optimizer. We set the initial learning rate to 1.e -2 and 1.e -3 for the reconstruction and augmentation network respectively, and we dropped its value by a factor 10  (Ψ t | xt-1 , y i , Φ i ). The dashed line shows variational approximations 𝚿 ! 𝒙 # ! 𝚿 !"# 𝒙 # !"# 𝜙 𝜃 !"# , 𝛾 !"# 𝜃 ! , 𝛾 ! 𝜙 ⋯ ⋯ ⋯ ⋯ 𝚿 $ 𝚿 $"# ⋯ ⋯ 𝜉 𝜉 𝑞 ! 𝚿 " |𝒙 % #$% , 𝒚 & , 𝚽 𝒊 𝑝 ( (𝚿 " |𝚿 "$% ) every time the loss stopped to improve for more than 30 training steps. Moreover, we set the weight decay to 5.e -4 and the batch size to 128. We applied the same scheme across all the dataets we used. As the objective function, we used the MSE between the ground truth signal and the reconstructed one for image datasets and the NMSE for the synthetic dataset, respectively. Concerning the VLISTA training, similarly to what we did for A-DLISTA, we train the full model end-to-end. However, compared to A-DLISTA, in this case the hyperparameter space has a much higher dimension. Therefore, we employed the Tune library for hyperparameter search. Specifically, we used HyperOptSearch and ASHAScheduler as searcher and scheduler, respectively. We set the learning rates to 7.e -3 , 5.e -3 , and 1.e -4 for the likelihood, posterior, and prior models, respectively. Also in this case, we use a scheduler for reducing the learning rate similarly to what we did for A-DLISTA. Concerning the objective function, we maximize the ELBO as typically done for such a type of models. Moreover, we set the weight for the KL divergence to 1.e -3 . We report in Equation 11details about the obejctive. log p(x 1:T = x i gt |y i , Φ i ) = log p(x 1:T = x i gt |Ψ 1:T , y i , Φ i )p(Ψ 1:T )dΨ 1:T (11) = log p(x 1:T = x i gt |Ψ 1:T , y i , Φ i )p(Ψ 1:T )q(Ψ 1:T |y i , Φ i , x1:T ) q(Ψ 1:T |y i , Φ i , x1:T ) dΨ 1:T ≥ q(Ψ 1:T |y i , Φ i , x1:T ) log p(x 1:T = x i gt |Ψ 1:T , y i , Φ i )p(Ψ 1:T ) q(Ψ 1:T |y i , Φ i , x1:T ) dΨ 1:T = q(Ψ 1:T |y i , Φ i , x1:T ) log p(x 1:T = x i gt |Ψ 1:T , y i , Φ i )dΨ 1:T + q(Ψ 1:T |y i , Φ i , x1:T ) log p(Ψ 1:T ) q(Ψ 1:T |y i , Φ i , x1:T ) dΨ 1:T = T t=1 E Ψ1:t∼q(Ψ1:t|y i ,Φ i , x0:t-1) log p(x t = x i gt |Ψ 1:t , y i , Φ i ) - T t=2 E Ψ1:t-1∼q(Ψ1:t-1|y i ,Φ i , xt-1) D KL q(Ψ t |y i , Φ i , xt-1 ) ∥ p(Ψ t |Ψ t-1 ) -D KL q(Ψ 1 | x0 ) ∥ p(Ψ 1 ) Finally, to give an overall overview of the training scheme for VLISTA, Figure 6 shows a diagram of the full training pipeline. C Derivation for Theorem 3.1 Convergence proofs of ISTA type models involve two steps in general. First, it is investigated how the support is found and locked in, and second how the error shrinks at each step. We focus on these two steps, which matter mainly for our architecture design. Our analysis is similar in nature to In what follows, we consider noiseless setting. However, the results can be extended to noisy setups by adding additional terms containing noise norm similar to Chen et al. (2018) . We make following assumptions: 1. There is a ground-truth (unknown) dictionary Ψ o such that x * = Ψ o z * . 2. As a consequence, y = ΦΨ o z * . 3. We assume that z * is sparse with its support contained in S. In other words: z i, * = 0 for i ∈ S c . As a first step, we fix the sensing matrix Φ and conduct the analysis. First define the following: μ := max 1≤i̸ =j≤N ((ΦΨ t ) i ) ⊤ (ΦΨ t ) j μ2 := max 1≤i,j≤N ((ΦΨ t ) i ) ⊤ (Φ(Ψ t -Ψ o )) j (13) δ(γ) := max i 1 -γ ∥(ΦΨ t ) i ∥ 2 2 (14) The main step of soft-thresholding algorithm is given as follows: z t = η θt z t-1 + γ t (ΦΨ t ) ⊤ (y -ΦΨ t z t-1 ) , with entry-wise relation given by z t,i = η θt z t-1,i + γ t ((ΦΨ t ) i ) ⊤ (y -ΦΨ t z t-1 ) . Using the assumptions we have: (ΦΨ t ) ⊤ (y -ΦΨ t z t-1 ) = (ΦΨ t ) ⊤ (ΦΨ o z * -ΦΨ t z t-1 ) = (ΦΨ t ) ⊤ Φ(Ψ o z * -Ψ t z t-1 ). C.1 Locking the support First, we show under what conditions, the algorithm locks on the support. Suppose that the support of z t-1 is already the same as z * , namely supp(z t-1 ) = supp(z * ) = S. Consider i ∈ S c . We have z t,i = η θt γ t ((ΦΨ t ) i ) ⊤ (y -ΦΨ t z t-1 ) . To lock the support, we need to guarantee that: γ t ((ΦΨ t ) i ) ⊤ (y -ΦΨ t z t-1 ) ≤ θ t . We have: ((ΦΨ t ) i ) ⊤ Φ(Ψ o z * -Ψ t z t-1 ) =((ΦΨ t ) i ) ⊤ Φ(Ψ t z * -Ψ t z t-1 ) + ((ΦΨ t ) i ) ⊤ Φ(Ψ o z * -Ψ t z * ) = j∈S ((ΦΨ t ) i ) ⊤ (ΦΨ t ) j (z * ,j -z t-1,j ) (21) + ((ΦΨ t ) i ) ⊤ Φ(Ψ o z * -Ψ t z * ) We can bound the norm by: j∈S ((ΦΨ t ) i ) ⊤ (ΦΨ t ) j (z * ,j -z t-1,j ) ≤ j∈S ((ΦΨ t ) i ) ⊤ (ΦΨ t ) j |(z * ,j -z t-1,j )| (23) ≤ μ ∥z * -z t-1 ∥ 1 , where we use the definition of mutual coherence for the upper bound. The other norm is bounded by Note that this analysis shows that we strive for having smaller μ2 , namely being close to the dictionary, small enough θ t as iterations grow, and small μ and δ(γ t ). ((ΦΨ t ) i ) ⊤ Φ(Ψ o z * -Ψ t z * ) = j∈S ((ΦΨ t ) i ) ⊤ (Φ(Ψ o -Ψ t )) j z j, * In 



Preprint. Under review.



Figure 2: p-value for OOD rejection as a function of the noise level. The green line represents a reference p-value equal to 0.05.

Figure 3: Augmentation network for A-DLISTA. Each red block in Figure 1 corresponds to shown model.Concerning VLISTA, as introduced in subsection 3.4, it comprises three different models: prior, posterior, and likelihood. Concerning the likelihood model, it is assumed to represent a gaussian distribution whose mean is parametrized by means of the A-DLISTA model (subsubsection 3.4.3). Instead, the prior (subsubsection 3.4.1) and posterior (subsubsection 3.4.2) models are implemented using an encoder-decoder scheme based on convolutional layers. We report in Figure4the architecture for the prior and posterior mdoels.

Figure 4: Left: prior network architecture. Right: posterior network architecture. For the posterior model we show in the figure the output shape from each of the three head. Such a structure is necessary since the posterior model accepts as input three quantities, namely, the observations, the sensing matrix, and the reconstruction from the previous layer which are characterized by different shapes. The term "B" indicates the batch size.

Figure 5: Graphical model of the Variational LISTA model -dependencies on y i , Φ i are factored out for simplicity. The sampling is done only based on the posterior q ϕ (Ψ t | xt-1 , y i , Φ i ). The dashed line shows variational approximations

Figure 6: Variational LISTA oveall architecture and training pipeline. Note that to simplify the figure, we did not report the index for the data sample (i) but only for the iteration (t).

. (2020) provides various guidelines for architecture change to improve LISTA for example in convergence, parameter efficiency, step size and threshold adaptation, and overshooting. The common assumptions of these works are fixed and known sparsifying dictionary and fixed sensing matrix. Steps toward relaxing these assumptions were taken inAberdam et al. (2021);Behboodi et al. (2022);Schnoor et al. (2022). InAberdam et al. (2021), the authors propose a model to deal with varying sensing matrix (dictionary). The authors inSchnoor et al. (2022);Behboodi et al.

MNIST SSIM (the higher the better) for different number of measurements. Top three rows concern non-Bayesian models whilst the bottom two report results for Bayesian approaches. For each of the two set of results, we highlight in bold the best performance.

CIFAR10 SSIM (the higher the better) for different number of measurements. Top three rows concern non-Bayesian models whilst the bottom two report results for Bayesian approaches. For each of the two set of results, we highlight in bold the best performance.

NMSE's quantile (the lower the better) for different number of measurements. Top three rows concern non-Bayesian models whilst the bottom two report results for Bayesian approaches. For each of the two set of results, we highlight in bold the best performance.

Chen et al. (2018);Aberdam et al. (2021), however it differs fromAberdam et al. (2021) in considering unknown dictionaries and fromChen et al. (2018) in both considered architecture and varying sensing matrix.

j∈S ((ΦΨ t ) i ) ⊤ (Φ(Ψ o -Ψ t )) j |z j, * | (26) ≤ μ2 ∥z * ∥ 1 .(27)Therefore, we obtain the following sufficient condition for locking the support:γ t (μ ∥z * -z t-1 ∥ 1 + μ2 ∥z * ∥ 1 ) ≤ θ t (28)If this condition is satisfied, we get to lock the support.C.2 Controlling the errorsFor i ∈ S, we have:|z t,i -z * ,i | ≤ z t-1,i + γ t ((ΦΨ t ) i ) ⊤ (y -ΦΨ t z t-1 ) -z * ,i + θ t . ,i | ≤ ≤ (δ(γ t ) + γ t μ(|S| -1)) ∥z S,t-1 -z * ∥ 1 + γ t μ2 |S| ∥z * ∥ 1 + |S|θ t .

Conv2D

 (in_c=1, out_c=32, k=3, s=2) Tanh() out_c=32, k=3, s=2)   Tanh()   ConvT2D  (in_c=32, out_c=1, k=4, s=2 

