LEARNED ISTA WITH ERROR-BASED THRESHOLDING FOR ADAPTIVE SPARSE CODING Anonymous

Abstract

The learned iterative shrinkage thresholding algorithm (LISTA) introduces deep unfolding models with learnable thresholds in the shrinkage function for sparse coding. Drawing on some theoretical insights, we advocate an error-based thresholding (EBT) mechanism for LISTA, which leverages a function of the layer-wise reconstruction error to suggest an appropriate threshold value for each observation on each layer. We show that the EBT mechanism well-disentangles the learnable parameters in the shrinkage functions from the reconstruction errors, making them more adaptive to the various observations. With rigorous theoretical analyses, we show that the proposed EBT can lead to faster convergence on the basis of LISTA and its variants, in addition to its higher adaptivity. Extensive experimental results confirm our theoretical analyses and verify the effectiveness of our methods.

1. INTRODUCTION

Sparse coding is widely used in many machine learning applications (Xu et al., 2012; Dabov et al., 2007; Yang et al., 2010; Ikehata et al., 2012) , and its core problem is to deduce the high-dimensional sparse code from the obtained low-dimensional observation, for example, under the assumption of y = Ax s + ε, where y ∈ R m is the observation corrupted by the inevitable noise ε ∈ R m , x s ∈ R n is the sparse code to be estimated, and A ∈ R m×n is an over-complete dictionary matrix. To recover x s purely from y is called sparse linear inverse problem (SLIP). The main challenge for solving SLIP is its ill-posed nature because of the over-complete modeling, i.e., m < n. A possible solution to SLIP can be obtained via solving a LASSO problem using the l 1 regularization: min x y -Ax 2 + λ x 1 . Possible solutions for Eq. ( 1) are iterative shrinking thresholding algorithm (ISTA) (Daubechies et al., 2004) and its variants, e.g., fast ISTA (FISTA) (Beck & Teboulle, 2009) . Despite their simplicity, these traditional optimization algorithm suffer from slow convergence speed in large scale problems. Therefore, Gregor & LeCun (2010) proposed the learned ISTA (LISTA) which was a deep neural network (DNN) whose architecture followed the iterative process of ISTA. The thresholding mechanism was modified into shrinkage functions in the DNNs together with learnable thresholds. LISTA achieved superior performance in sparse coding, and many theoretical analyses have been proposed to modify LISTA to further improve its performance (Chen et al., 2018; Liu et al., 2019; Zhou et al., 2018; Ablin et al., 2019; Wu et al., 2020) . Yet, LISTA and many other deep networks based on it suffer from two issues. (a) Though the thresholds of the shrinkage functions in LISTA were learnable, their values were shared among all training samples and thus lack adaptability to the variety of training samples and robustness to outliers. According to prior work (Chen et al., 2018; Liu et al., 2019) , the thresholds should be proportional to the upper bound of the norm of the current estimation error to guarantee fast convergence in LISTA. However, outliers with drastically higher estimation errors will affect the thresholds more, making the learned thresholds less suitable to other (training) samples. (b) For the same reason, it may also lead to poor generalization to test data with different distribution (or sparsity (Chen et al., 2018) ) from the training data. For instance, in practice, we may only be given some synthetic sparse codes but not the real ones for training, and current LISTA models may fail to generalize under such circumstances. In this paper, we propose an error-based thresholding (EBT) mechanism to address the aforementioned issues of LISTA-based models to improve their performance. Drawing on theoretical insights, EBT introduces a function of the evolving estimation error to provide each threshold in the shrinkage functions. It has no extra parameter to learn compared with original LISTA-based models yet shows significantly better performance. The main contributions of our paper are listed as follows: • The EBT mechanism can be readily incorporated into popular sparse coding DNNs (e.g., LISTA (Gregor & LeCun, 2010) and LISTA with support selection (Chen et al., 2018) ) to speed up the convergence with no extra parameters. • We give a rigorous analysis to prove that the estimation error of EBT-LISTA (i.e., a combination of our EBT and LISTA) and EBT-LISTA with support selection (i.e., a combination of our EBT and LISTA with support selection) is theoretically lower than the original LISTA (Gregor & LeCun, 2010) and LISTA with support selection (Chen et al., 2018) , respectively. In addition, the introduced parameters in our EBT are well-disentangled from the reconstruction errors and need only to be correlated with the dictionary matrix to ensure convergence. These results guarantee the superiority of our EBT in theory. • We demonstrate the effectiveness of our EBT in the original LISTA and several of its variants in simulation experiments. We also show that it can be applied to practical applications (e.g., photometric stereo analysis) and achieve superior performance as well. The organization of this paper is structured as follows. In Section 2, we will review some preliminary knowledge of our study. In Section 3, we will introduce a basic form of our EBT and several of its improved versions. Section 4 provides a theoretical study of the convergence of EBT-LISTA. Experimental results in Section 5 valid the effectiveness of our method in practice. Section 6 summarizes this paper.

2. BACKGROUND AND PRELIMINARY KNOWLEDGE

As mentioned in Section 1, ISTA is an iterative algorithm for solving LASSO in Eq. ( 1). Its update rule is: x (0) = 0 and x (t+1) = sh λ/γ ((I -A T A/γ)x (t) + A T y/γ), ∀t ≥ 0, where sh b (x) = sign(x)(|x| -b) + is a shrinkage function with a threshold b ≥ 0 and (•) + = max{0, •}, γ is a positive constant scalar greater than or equal to the maximal eigenvalue of the symmetric matrix A T A. LISTA kept the update rule of ISTA but learned parameters via end-to-end training. Its inference process can be formulated as x (0) = 0 and x (t+1) = sh b (t) (W (t) x (t) + U (t) y), t = 0, . . . , d, where Θ = {W (t) , U (t) , b (t) } t=0,...,d is a set of learnable parameters, and, specifically, b (t) is the layer-wise threshold which is learnable but shared among all samples. LISTA achieved lower reconstruction error between its output and the ground-truth x s compared with ISTA, and it is proved to convergence linearly (Chen et al., 2018) with W (t) = I -U (t) A holds for any layer t. Thus, Eq. ( 3) can be written as. x (t+1) = sh b (t) ((I -U (t) A)x (t) + U (t) y), t = 0, . . . , d. Chen et al. ( 2018) further proposed support selection for LISTA, which introduced shp (b (t) ,p) (x) whose elements are defined as (shp (b (t) ,p) (x)) i =    sign(x i )(|x i | -b), if |x i | > b, i / ∈ S p x i , if |x i | > b, i ∈ S p 0, otherwise , to substitute the original shrinking function sh b (t) (x), where S p is the set of the index of the largest p% elements (in absolute value) in vector x. Formally, the update rule of LISTA with support selection is formulated as x (0) = 0 and x (t+1) = shp (b (t) ,p (t) ) ((I -U A)x (t) + U (t) y), t = 0, . . . , d, where p (t) is a hyper-parameter and it increases from early layers to later layers. LISTA with support selection can achieve faster convergence compared with LISTA (Chen et al., 2018) . Theoretical studies (Chen et al., 2018; Liu et al., 2019) also demonstrate that the threshold of LISTA and its variants should satisfy the equality b (t) ← µ(A) sup xs∈S x (t) -x s p (7) to ensure fast convergence in the noiseless case (i.e., ε = 0), where S is the training set and µ(A) is the general mutual coherence coefficient of the dictionary matrix A. Note that µ(A) is a crucial term in this paper, here we formally give its definition together with the definition of W(A) as follows. Definition 1 For A ∈ R m×n , its generalized coherence coefficient is defined as µ(A) = inf W ∈R n×m ,Wi,:A:,i=1 max i =j |(W i,: A :,j )|, and we say W ∈ W(A) if max i =j (W i,: A :,j ) = µ(A).

3. METHODS

In LISTA and its variants, the threshold b (t) is commonly treated as a learnable parameter. As demonstrated in Eq. ( 7), b (t) should be proportional to the upper bound of the estimation error of the t-th layer in the noiseless case to ensure fast convergence. Thus, some outliers or extreme training samples largely influence the value of b (t) , making the obtained threshold not fit the majority of the data. To be specific, we know that the suggested value of b (t) is b (t) = µ(A)sup i=0,1,...,n x ix si p for x 1 a training set {x si } i=0,1,...,n , and normal training of LISTA leads to it in theory (Chen et al., 2018) . Yet, if a new training sample x sn+1 with higher reconstruction error is introduced, the expected b (t) shall change to µ(A) x (t) n+1 -x sn+1 p , which is probably undesirable for the other samples. Similar problems occur if there exists a large variety in the value of reconstruction errors. In order to solve this problem, we propose to disentangle the reconstruction error term from the learnable part of the threshold and introduce adaptive thresholds for LISTA and related networks. We attempt to rewrite the threshold at the t-th layer as something like b (t) = ρ (t) x (t) -x s p , where ρ (t) is a layer-specific learnable parameter. However, the ground-truth x s is actually unknown for the inference process in SLIP. Therefore, we need to find an alternative formulation. Notice that in the noiseless case, it holds that Ax (t) -y = A(x (t) -x s ), thus we further rewrite Eq. (8) into b (t) = ρ (t) Q(Ax (t) -y) p , where Q ∈ R n×m is a compensation matrix introduced to let Eq. ( 9) approximate Eq. ( 8) better, i.e., a matrix that makes QA approaches the identity matrix is more desired. However, note that although QA is a low-rank matrix and can never be an identity matrix, we can encourage its diagonal elements to be 1 and the non-diagonal elements to be nearly zero. This can be directly achieved by letting Q ∈ W(A), where W(A) is defined in Definition 1. According to some prior works (Liu et al., 2019; Wu et al., 2020; Chen et al., 2018) , we also know that U (t) ∈ W(A) to guarantee linear convergence, thus U (t) can probably be a reasonable option for the matrix Q in our method, making it layer-specific as well. Therefore, our EBT-LISTA is formulated as x (0) = 0 and x (t+1) = sh b (t) ((I -U (t) A)x (t) + U (t) y), t = 0, . . . , d, b (t) = ρ (t) U (t) (Ax (t) -y) p . ( ) Note that only ρ (t) and U (t) are learnable parameters in the above formulation, thus our EBT-LISTA actually introduces no extra parameters compared with the original LISTA. The architecture of LISTA and EBT-LISTA are shown in Figure 1 . We can also apply our EBT mechanism on LISTA with support selection (Chen et al., 2018) . It is straightforward to keep the support set selection operation and replace the fixed threshold with our EBT, and such a combination can be similarly formulated as x (0) = 0 and Our former analysis is based on the noiseless case. For noise case, there is A(x (t) -x s ) = Ax (t)y + ε. Since the noise is generally unknown in practical problems, we may add an extra learnable parameter on the threshold to compensate the noise, i.e., x (t+1) = shp (b (t) ,p (t) ) ((I -U (t) A)x (t) + U (t) y), t = 0, . . . , d, b (t) = ρ (t) U (t) (Ax (t) -y) p . b (t) = ρ (t) U (t) (Ax (t) -y) p + α (t) , ( ) where α (t) is the learnable parameter for the observation noise.

4. THEORETICAL ANALYSIS

In this section, we provide convergence analyses for LISTA and LISTA with support selection. We focus on the noiseless case and the main results are obtained under a mild assumption of the ground-truth sparse code. To be more specific, we assume that the ground-truth sparse vector x s is sampled from the distribution γ(B, s), i.e., the number of its nonzero elements follows a uniform distribution U (0, s) and the magnitude of its nonzero elements follows a arbitrary distribution in [-B, B]. Compared with the assumptions made in some recent work (Chen et al., 2018; Liu et al., 2019; Wu et al., 2020)  , i.e., X (B, s) = {x s | x s 0 ≤ s, x s ∞ ≤ B}, our assumption here provides a more detailed yet also stricter description of the distribution of x s , especially for the sparsity of x s . In fact, it can be easily derived that our assumption can also be rewritten in the form of some prior assumptions. In particular, for ∀x s ∈ γ(B, s), there exist Wu et al., 2020) . We will mention the condition that s is sufficiently small for error-based thresholding, which means µ(A)s 1 specifically. x s ∈ X (B, s) := {x s | x s 0 ≤ s, x s ∞ ≤ B} (

4.1. ERROR-BASED THRESHOLDING ON LISTA

Let us first discuss the convergence of LISTA and how our EBT improves LISTA in accelerating convergence. Proof of all our theoretical results can be found in Appendix A.2. To get started, we would like to recall the theoretical guarantee of the convergence of LISTA, which was given in the work of Chen et al. (2018) 's. Lemma 1 (Chen et al., 2018) . For LISTA formulated in Eq. ( 4), if x s is sampled from γ(B, s) and s is small such that µ(A)(2s-1) < 1, with U (t) ∈ W(A), assume that b (t) = µ(A) sup xs x (t) -x s 1 is achieved to guarantee the "no false positive" property (i.e. supp(x (t) ) ⊂ supp(x s )), then the estimation x (t) at the t-th layer of LISTA satisfies x (t) -x s 2 ≤ sB exp (c 0 t), where c 0 = log((2s -1)µ(A)) < 0. The above lemma shows that the reconstruction error of LISTA decreases at a rate of c 0 and the reconstruction error bound is also related to s and B. With Lemma 1, we further show in the following theorem that the convergence of EBT-LISTA is similarly bounded from above. Theorem 1 (Convergence of EBT-LISTA) For EBT-LISTA formulated in Eq. ( 10), for p = 1, if x s is sampled from γ(B, s) and s is sufficiently small, with U (t) ∈ W(A), assume ρ (t) = µ(A) 1-µ(A)s is achieved to guarantee the "no false positive" property, then the estimation x (t) at the t-th layer satisfies x (t) -x s 2 ≤ q 0 exp (c 1 t), where q 0 < sB and c 1 < c 0 hold with the probability of 1 -µ(A)s. From Theorem 1 we know that EBT-LISTA converges similarly with a rate of c 1 , which is probably faster than that of the original LISTA. In addition, we show that its reconstruction error at each layer with an index t is also probably lower than that of the original LISTA, with q 0 < sB. Since µ(A) is generally small in practical and s is assumed to be sufficiently small such that µ(A)s 1, the probability of achieving the superiority is very high in theory. We will further show in experiments that even if s is not that small (i.e., the sparsity is not that high), our EBT can still achieve favorable improvement on the basis of LISTA. Moreover, unlike the desired threshold in the original LISTA (i.e., µ(A) sup xs x (t) -x s 1 ) which depends on specific training samples, the desired threshold in EBT-LISTA is disentangled with the reconstruction error.

4.2. ERROR-BASED THRESHOLD ON LISTA WITH SUPPORT SELECTION

Effective prior work from Chen et al. (2018) proposed to improve the performance of LISTA with the operation of support set selection, and it helps to achieve lower error bound when compared with the original LISTA. We first give a detailed discussion on the convergence of LISTA combined with support selection before we delve deeper into the theoretical study of how combining our EBT with them further improves the performance. Lemma 2 (Convergence of LISTA with support selection) For LISTA with support selection as formulated in Eq. ( 6), if x s is sampled from γ(B, s) and s is sufficiently small, with U (t) ∈ W(A), assume that b (t) = µ(A) sup xs x (t) -x s 1 is achieved and p (t) is sufficiently large, there actually exist two convergence phases. In the first phase, i.e., t ≤ t 0 , the t-th layer estimation x (t) satisfies x (t) -x s 2 ≤ sB exp (c 2 t), where c 0 ≤ log((2s -1)µ(A)). In the second phase, i.e., t > t 0 , the estimation x (t) satisfies x (t) -x s 2 ≤ C x (t-1) -x s 2 , where C ≤ sµ(A). Lemma 2 shows that when powered with support selection, LISTA shows two different convergence phases. The earlier phase is generally slower and the later phase is faster. On the basis of the theoretical results in Lemma 2, we further show in the following theorem that the convergence of EBT-LISTA with support selection is similarly bounded and also has two phases. Theorem 2 (Convergence of EBT-LISTA with support selection) For EBT-LISTA with support selection and p = 1, if x s is sampled from γ(B, s) and s is sufficiently small, with U (t) ∈ W(A), assume ρ (t) = µ(A) 1-µ(A)s is achieved and p (t) is sufficiently large, there exist two convergence phases. In the first phase, i.e., t ≤ t 1 , the t-th layer estimation x (t) satisfies x (t) -x s 2 ≤ q 1 exp (c 3 t), where c 3 < c 2 , q 1 < sB and t 1 < t 0 hold with a probability of 1 -µ(A)s. In the second phase, i.e., t > t 1 , the estimation x (t) satisfies x (t) -x s 2 ≤ C x (t-1) -x s 2 , where C ≤ sµ(A). The above theorem shows that further incorporated with EBT, the model shall also show two different phases of convergences. In addition, it processes the same rate of convergence in the second phase, comparing with the results of the original LISTA with support selection in Lemma 2, while in the first phase, our EBT leads to faster convergence and faster enter the second phase, which shows the effectiveness of our EBT in LISTA with support selection in theory.

5. EXPERIMENTS

We conduct extensive experiments on both synthetic data and real data to validate our theorem and testify the effectiveness of our methods. The network architectures and training strategies in our experiments most follow those of prior works (Chen et al., 2018; Wu et al., 2020) . To be more specific, all the compared networks have d = 16 layers and their learnable parameters W (t) , U (t) , ρ (t) (or b (t) without out EBT) are not shared between layers. The training batch size is 64, and we use the popular Adam optimizer (Kingma & Ba, 2014) for training with its default hyper-parameters β 1 = 0.9, β 2 = 0.999. The training is performed from layer to layer in a progressive way, i.e., if the validation loss of the current layer does not decrease for 4000 iterations, the training will move to the next layer. When training each layer, the learning rate is first initialized to be 0.0005 and it will then decrease to 0.0001 and finally decrease to 0.00001, if the validation loss does not decrease for 4000 iterations. Specifically, in our proposed methods, we impose the constraint between W (t) and U (t) and make sure it holds that W (t) = I -U (t) A, ∀t, i.e., the coupled constraints are introduced for all the evaluated models. For models empowered with support selection, we will append "SS" to their name for clarity, i.e., LISTA with support selection is renamed "LISTA-SS" in the following discussions. The value of ρ (t) is initialized to be 0.02.

5.1. SIMULATION EXPERIMENTS

Basic settings. In simulation experiments, we set m = 250, n = 500, and we generate the dictionary matrix A by using the standard Gaussian distribution. The indices of the non-zero entries in x s are determined by a Bernoulli distribution letting its sparsity (i.e., the probability of any of its entry be zero) be p b , while the magnitudes of the non-zero entries are also sampled from the standard Gaussian distribution. The noise ε is sampled from a Gaussian distribution where the standard deviation is determined by the noise level. With y = Ax s +ε, we can randomly synthesize in-stream x s and get a corresponding set of observations for training. Similarly, we also synthesize two sets for validation and test, respectively, each contains 1000 samples. The sparse coding performance of different models is evaluated by the normalized mean squared error (NMSE) in decibels (dB): NMSE(x, x s ) = 10 log 10 x -x s 2 2 x 2 2 . ( ) Disentanglement. First we would like to compare the learned parameters (i.e., b (t) and p (t) ) for thresholds in both LISTA and our EBT-LISTA. = ρ (t) U (t) (Ax (t) -y) 1 ) concerned in the theorem, we also test EBT-LISTA-SS with the l 2 norm (i.e., letting b (t) = ρ (t) U (t) (Ax (t) -y) 2 ). It can be seen that, with both the l 1 and l 2 norms, EBT-LISTA-SS leads to consistently faster convergence than LISTA. Also, it is clear that there exist two convergence phases for EBT-LISTA-SS and LISTA-SS, and the later phase is indeed faster. With faster convergence, EBT-LISTA-SS finally achieves superior performance. The experiment is performed in the noiseless case with p b = 0.95. Similar observations can be made on the basis of other variants of LISTA (e.g., ALISTA, see Figure 4(b) ). . Adaptivity to unknown sparsity. As have been mentioned, in some practical scenarios, there may exist a gap between the training and test data distribution, or we may not know the distribution of real test data and will have to train on synthesized data based on the guess of the test distribution. Under such circumstances, it is of importance to consider the adaptivity/generalization of the sparse coding model (trained on a specific data distribution or with a specific sparsity) to test data sampled from different distributions with different sparsity. We conduct an experiment to test in such a scenario, in which we let the test sparsity be different from the training sparsity. Figure 5 shows the results in three different settings. The black curves represent the optimal model when LISTA is trained on exactly the same sparsity as that of the test data. It can be seen that our EBT has huge advantages in such a practical scenario where the adaptivity to un-trained sparsity is required, and the performance gain of LISTA is larger when the distribution shift between training and test is larger (cf. purple line and yellow line in Figure 5 We use "Origin" to indicate the optimal scenario where the LISTA models are trained on exactly the same sparsity as that of the test data. Comparison with competitors. Here we compare EBT-LISTA-SS, EBT-LISTA, and EBT-ALISTA with other methods comprehensively. In addition to LISTA-SS, LISTA, and ALISTA, we also compare with learned AMP (LAMP) (Borgerding et al., 2017) here. The performance of different net-works under different noise levels are shown in Figure 6 . It can be seen that when combined with LISTA and its variants, our EBT achieves better or similar performance. Figure 6 (a) demonstrates that the combination of ALISTA and EBT performs the best in the noiseless case (i.e., SNR=∞), yet it is inferior to the other networks when noise presents. The figures also show that the performance of our EBT is more promising in the noiseless or low noise cases (i.e., SNR=∞ and SNR=40dB), while in a very noisy scenarios it provides little help. We further test different networks under different sparsity and different condition numbers. Note that for the methods with support selection (i.e., EBT-ALISTA, EBT-LISTA-SS, ALISTA, and LISTA-SS), p and p max are set as 0.6 and 6.5 when p b = 0.95, and are set as 1.2 and 13.0 when p b = 0.9. Figure 7 demonstrates some of the results in different settings, while more results can be found in Appendix A.1. In all settings, we can see that our EBT leads to significantly faster convergence. In addition, the superiority of our EBT-based models is more significant with a larger p b for which the assumption of a sufficiently small s is more likely to hold (comparing Figure 7 (a), 9(a), and 9(b)). We also tried training with p b = 0.99, yet we found that some classical models failed to converge in such a setting so the results are now shown here. 

5.2. PHOTOMETRIC STEREO ANALYSIS

We also consider a practical sparse coding task: photometric stereo analysis (Ikehata et al., 2012) . The task solves the problem of estimating the normal direction of a Lambertian surface, given q observations under different light directions. It can be formulated as o = ρLn + e, where o ∈ R q is the observation, n ∈ R 3 represents the normal direction of the Lambertian surface which is to be estimated, L ∈ R q×3 represents the normalized light directions, e is the noise vector, and ρ is the diffuse albedo scalar. Although the normal vector n is unconstrained in Eq. ( 14), the noise vector e is found to be generally sparse (Wu et al., 2010; Ikehata et al., 2012) . Therefore, we may estimate the noise e first. We introduce the orthogonal complement of L, denoted by L † , to rewrite Eq. ( 14) as L † o = ρL † Ln + L † e = L † e. (15) On the basis of the above equation, the estimation of e is basically a sparse coding problem in the noiseless case, where L † is the dictionary matrix A, e is the sparse code x s to be estimated in the reformulated problem, and L † o is the observation y. Once we have achieved a reasonable estimation of e, we can further obtain n by using the equation n = L † (o -e). In this experiment, we mainly follow the settings in Xin et al. 2016's and Wu et al. 2020's work. We use the same bunny picture for evaluation and L is also randomly selected from the hemispherical surface. We set the number of observations q to be 15, 25, and 35 and let the training sparsity of e be p t = 0.8. The final performance is evaluated by calculating the average angle between the estimated normal vector and the ground-truth normal vector (in degree). Since the distribution of the noise is generally unknown in practice, the adaptivity is of great importance for this task. We use two test settings for evaluating different models, in which the sparsity of the noise in test data (i.e., p e ) is set as 0.8 and 0.9. We compare EBT-LISTA-SS, LISTA-SS, and two conventional methods including using the least squares (codenamed: l s ) and least 1-norm (codenamed: l 1 ) in Table 1 . Also, we build the reconstruction 3D error maps for LISTA-SS and EBT-LISTA-SS, shown in Figure 8 in the appendix. The results show that EBT-LISTA-SS outperforms all the competitors in all the concerned settings, note that the advantage is remarkable when p e = 0.9, which means our EBT-based network has better adaptivity and can be more effective in this practical tasks. Table 1 : Mean error (in degree) with different number of observations and different test sparsity pe q ls l1 LISTA-SS EBT-LISTA-SS 15 3.41 0.678 5.50 × 10 -2 4.09 × 10 -2 0.8 25 3.05 0.408 7.48 × 10 -3 3.17 × 10 -3 35 2.78 0.336 1.89 × 10 -3 5.95 × 10 -4 15 1.94 0.232 6.67 × 10 -3 2.57 × 10 -3 0.9 25 0.145 2.03 1.33 × 10 -3 1.64 × 10 -4 35 1.61 0.088 2.93 × 10 -4 4.91 × 10 -5

6. CONCLUSION

In this paper, we have studied the thresholds in the shrinkage functions of LISTA. We have proposed a novel EBT mechanism that well-disentangles the learnable parameter in the shrinkage function on each layer of LISTA from its layer-wise reconstruction error. We have proved theoretically that, in combination with LISTA and its variants, our EBT mechanism leads to faster convergence and achieves superior final sparse coding performance. Also, we have shown that the EBT mechanisms endow deep unfolding models higher adaptivity to different observations with a variety of sparsity. Our experiments on both synesthetic data and real data have testified the effectiveness of our EBT, especially when the distribution of the test data is different from that of the train data. We hope to extend our EBT mechanism to more complex tasks in future work. A APPENDIX 

A.1 ADDITIONAL SIMULATION RESULTS

Comparison with competitors. More experiment results than those in Figure 7 are given here. Figure 9 shows the results of our methods with EBT (i.e. EBT-LISTA, EBT-LISTA-SS, and EBT-ALISTA) and other competitors in settings including p b =0.9, p b =0.8 (i.e. the sparsity is 0.9 and 0.8), and when the condition number is set as 3 (with p b = 0.9). For those concerned methods with support selection (EBT-ALISTA, EBT-LISTA-CPSS, ALISTA, and LISTA-CPSS), p and p max are set as 1.2(1.5) and 13(16.25) when p b = 0.9(0.8). From Figure 9 , we can find that our EBT leads to better performance as shown in the results in the main paper. EBT mechanism on (F)ISTA. We further conduct our proposed EBT mechanism on standard ISTA and FISTA. Note that ISTA and FISTA are nonlineared convergence. We set the scalar γ (in Eq. 2) as a constant and the regularization coefficients λ (in Eq. 1 and Eq. 2) as 0.1 and 0.2. In our proposed methods, we set λ Ax (t) -y 1 /γ as the thresholds, comparaed with λ/γ in (F)ISTA. From the experiment results shown in Figure 11 , we can find that our EBT mechanism leads to faster convergence. However the fast convergence have negative influence on the final performances. That is because the convergence speed of ISTA and FISTA are sub-lineared, while our EBT mechanism is proposed for linear convergence methods. Therefore, EBT-(F)ISTA might convergence too fast to reach a better performance.

A.2 PROOF OF THEOREMS

We first give some important notations before we delve into the proofs. The support set is defined as the index set of non-zero values of x and it is written as supp(x). We define S as the support set of the vector x s , and we further let |S| denote the element number of S. We denote (x) i as i-th element of the vector x and denote (M ) ij as the element from the i-th row and j-th column of the matrix M .

A.2.1 PROOF OF THEOREM 1

Assume i / ∈ S, i.e., (x s ) i = 0. If (x (t+1) ) i = 0 , note that y = Ax s , there is, b (t) = ρ (t) U (t) (Ax (t) -y) 1 < |(x (t+1) ) i | < |[(I -U (t) A)x (t) + U (t) Ax s ] i | = |[(I -U (t) A)(x (t) -x s ) + x s ] i | ≤ |[(I -U (t) A)(x (t) -x s )] i | + |(x s ) i | = |[(I -U (t) A)(x (t) -x s )] i | = | j (I -U (t) A) ij (x (t) -x s ) j | ≤ j |(I -U (t) A) ij (x (t) -x s ) j | ≤ j µ(A)|(x (t) -x s ) j | ≤ µ(A) x (t) -x s 1 . From above, we also have Since [(I -U (t) A)(x (t) -x s )] i ≤ µ(A) x (t) -x s 1 , therefore we have (I - U (t) A)(x (t) -x s ) 1 ≤ |S|µ(A) x (t) -x s 1 . Since U (t) A(x (t) -x s ) 1 = (x (t) -x s ) -(I - U (t) A)(x (t) -x s ) 1 , we have (1 -|S|µ(A)) x (t) -x s 1 ≤ U (t) A(x (t) -x s ) 1 ≤ (1 + |S|µ(A)) x (t) -x s 1 . ρ (t) = µ(A) 1-µ(A)s ≥ µ(A) 1-µ(A)|S| , there is ρ (t) U (t) A(x (t) -x s ) 1 ≥ µ(A) x (t) -x s 1 . Eq. ( 16) and ( 18) are conflicted, which means (x (t+1) ) i = 0 if (x s ) i = 0 (i.e. supp(x (t+1) ) ⊂ S), which means our EBT-LISTA is also "no false positive". From Eq. ( 10), we have x (t+1) -x s = sh b (t) ((I -U (t) A)x (t) + U (t) y) -x s = (I -U (t) A)x (t) + U (t) Ax s -x s -b (t) h(x (t+1) ) = (I -U (t) A)(x (t) -x s ) -b (t) h(x (t+1) ), where h(x) = 1 if x > 0, h(x) = -1 if x < 0 and h(x) ∈ [-1, 1] if x = 0. For the i-th element of x (t+1) -x s , we have |(x (t+1) -x s ) i | = |(I -U (t) A)(x (t) -x s ) i -b (t) h(x (t+1) i )| ≤ |(I -U (t) A)(x (t) -x s ) i | + |b (t) |. Since supp(x (t+1) ) ⊂ S, we have x (t+1) -x s 1 = |S i (x (t+1) -x s ) i |. Thus we have x (t+1) -x s 1 ≤ S i (|(I -U (t) A)(x (t) -x s ) i | + |b (t) |) = S i (| S j (I -U (t) A) ij (x (t) -x s ) j | + |b (t) |) ≤ S i S j =i |(I -U (t) A) ij (x (t) -x s ) j | + |S||b (t) | ≤ (|S| -1)µ(A) x (t) -x s 1 + |S|ρ (t) U (t) A(x (t) -x s ) 1 ≤ (|S| -1)µ(A) x (t) -x s 1 + µ(A)|S| 1 -µ(A)s U (t) A(x (t) -x s ) 1 . ≤ (|S| + |S| 1 + µ(A)s 1 -µ(A)s -1)µ(A) (x (t) -x s ) 1 . The final step holds because |S| ≤ s and Eq.( 17) hold. The l 2 error bound of t-th output of EBT-LISTA can be calculated as x (t) -x s 2 ≤ x (t) -x s 1 ≤ ((|S| + |S| 1 + µ(A)s 1 -µ(A)s -1)µ(A)) t (x (0) -x s ) 1 ≤ q 0 exp(c 1 t), where q 0 = x s 1 and c 1 = log((|S| + |S| 1+µ(A)s 1-µ(A)s -1)µ(A)). Compare c 1 with c 0 , we have exp(c 0 ) -exp(c 1 ) = 2µ(A)(s - |S| 1 -µ(A)s ) > 0 (23) hold when |S| < s(1 -µ(A)s). Under this circumstance, we have q 0 = x s 1 ≤ |S|B < s(1 -µ(A)s)B ≤ sB. (24) Note that x s is sampled from γ(B, s), Eq. ( 23) and ( 24) hold with the probability with of 1 -η, where η = s -|S| s = µ(A)s. (25) A.2.2 PROOF OF LEMMA 2 For LISTA with support selection formulated as Eq.( 6), there is x (t+1) -x s = shp (b (t) ,p (t) ) ((I -U (t) A)x (t) + U (t) y) -x s = (I -U (t) A)x (t) + U (t) Ax s -x s -b (t) g(x (t+1) ) = (I -U (t) A)(x (t) -x s ) -b (t) g(x (t+1) ), where g(x) =        0, i ∈ S p , x i = 0 1, i / ∈ S p , x i > 0 -1, i / ∈ S p , x i < 0. [-1, 1], x i = 0 (27) For the i-th element of x (t+1) -x s , we have |(x (t+1) -x s ) i | = |(I -U (t) A)(x (t) -x s ) i -b (t) g(x (t+1) i )| ≤ |(I -U (t) A)(x (t) -x s ) i | + |b (t) g(x (t+1) i )|. Since b (t) = µ(A) sup xs x (t) -x s 1 , same as the standard LISTA, LISTA with support selection is also "no false positive". Therefore, supp(x (t+1) ) ⊂ S and x (t+1) -x s 1 = S i (x (t+1) -x s ) i . Similar to Eq.( 21), we have x (t+1) -x s 1 ≤ S i (|(I -U (t) A)(x (t) -x s ) i | + |b (t) g(x (t+1) i )|) = S i (| S j (I -U (t) A) ij (x (t) -x s ) j | + |b (t) g(x (t+1) i )|) ≤ S i S j =i |(I -U (t) A) ij (x (t) -x s ) j | + S i |b (t) g(x (t+1) i )| ≤ (|S| -1)µ(A) x (t) -x s 1 + S i |b (t) g(x (t+1) i )|. From Eq.( 27), we have |g(x (t+1) i )| ≤ 1 and g(x (t+1) i ) = 0 only if i ∈ S p and x (t+1) i = 0. We let S t+1 denote the number of non-zero entries in x (t+1) . Also, P t+1 denotes the number of the largest p (t+1) % elements (in absolute value) in x (t+1) . Therefore the number of zero entries in g(x (t+1) ) is min(S t+1 , P t+1 ). Then Eq.( 29) can be calculated as x (t+1) -x s 1 ≤ (|S| -1)µ(A) x (t) -x s 1 + S i |b (t) g(x (t+1) i )| ≤ (|S| -1)µ(A) x (t) -x s 1 + (|S| -min(S t+1 , P t+1 ))|b (t) | = (|S| -1)µ(A) x (t) -x s 1 + (|S| -min(S t+1 , P t+1 ))µ(A) sup xs x (t) -x s 1 . (30) Then, we take the supremum of Eq.( 30), there is sup xs x (t+1) -x s 1 ≤ (2s -1 -min(S t+1 , P t+1 ))µ(A) sup xs x (t) -x s 1 . (31) Note that x s 1 ≤ sB and assume k = arg min t (S t , P t ), the l 2 upper bound of t-th output can be calculated as x (t) -x s 2 ≤ x (t) -x s 1 ≤ sup xs x (t) -x s 1 ≤ t i=1 (2s -1 -min(S i , P i ))µ(A) sup xs x (0) -x s 1 ≤ ((2s -1 -min(S k , P k ))µ(A)) t sB ≤ sB exp(c 2 t), where c 2 = log((2s -1 -min(S k , P k ))µ(A)). Apparently, we have c 2 ≤ c 0 = log((2s -1)µ(A)). From Eq.( 32), we have x (t) -x s 1 ≤ sB exp(c 2 t), which means l 1 error bound can approaches to 0. Thus, there exists a t * , when t > t * , x (t) -x s 1 ≤ min i∈S (x s ) i . Note that |x (t) i -(x s ) i | ≤ x (t) -x s 1 . If i ∈ S, i.e., (x s ) i = 0, there exists x (t) i = 0, which means S ⊂ supp(x (t) ). Recall the "no false positive" property, i.e., supp(x (t) ) ⊂ S, we can conclude that supp(x (t) ) = S. Recall P t increases layerwise and p (t) is sufficiently large, there exists t statisfies P t > s, we let t 0 = max(t * , t ), if t > t 0 , there exists P t ≥ |S| and supp(x (t) ) = S. Under this circumstance, if i ∈ S, we have x (t) i = 0 and i ∈ S pt , which means every element in S will be selected as support. Therefore, we have x (t+1) i -(x s ) i = shp (b (t) ,p (t) ) (((I -U (t) A)x (t) + U (t) Ax s ) i ) -(x s ) i = ((I -U (t) A)x (t) + U (t) Ax s ) i -(x s ) i = ((I -U (t) A)(x (t) -x s )) i . (33) We let x S ∈ R |S| denote the vector that keeps the elements with indices of x in S and remove the others. Similarly, we let M (S, S) ∈ R |S|×|S| denote the submatrix of matrix M which keeps the row and column if the index belongs to S. Then, we have x (t+1) -x s 2 = (x (t+1) -x s ) S 2 = ((I -U (t) A)(x (t) -x s )) S 2 = (I -U (t) A)(S, S)(x (t) -x s ) S 2 ≤ (I -U (t) A)(S, S) 2 (x (t) -x s ) S 2 = C (x (t) -x s ) 2 , (34) where C = (I -U (t) A)(S, S) 2 . Further we have C ≤ (I -U (t) A)(S, S) F ≤ |S| 2 µ(A) 2 ≤ sµ(A).

A.2.3 PROOF OF THEOREM 2

For the EBT-LISTA with support selection formulated in Eq. ( 11), similar to Eq. ( 26) and (28), we have |(x (t+1) -x s ) i | = |(I -U (t) A)(x (t) -x s ) i -b (t) g(x (t+1) i )| ≤ |(I -U (t) A)(x (t) -x s ) i | + |b (t) g(x (t+1) i )|, where b (t) = ρ (t) U (t) (Ax (t) -y) 1 = µ(A) 1-µ(A)s U (t) A(x (t) -x s ) 1 . Same as the origin EBT-LISTA, EBT-LISTA with support selection is also "no false positive" and Eq. ( 17) hold either. Therefore U (t) A(x (t) -x s ) 1 ≤ (1 + |S|µ(A)) x (t) -x s 1 ≤ (1 + sµ(A)) x (t) -x s 1 . Similar to Eq. ( 29) and (30), there is (36) Similar to Eq. ( 32), the l 2 error bound can be calculated as x (t) -x s 2 ≤ x (t) -x s 1 ≤ t i=1 ( 2 1 -µ(A)s |S| - 1 + µ(A)s 1 -µ(A)s min(S i , P i ) -1)µ(A) (x (0) -x s ) 1 ≤ ( 2 1 -µ(A)s |S| -1 + µ(A)s 1 -µ(A)s min(S k , P k ) -1)µ(A) t x s 1 ≤ q 1 exp(c 3 t), where q 1 = x s 1 and c 3 = log(( 2 1-µ(A)s |S| -1+µ(A)s 1-µ(A)s min(S k , P k ) -1)µ(A)). Compare c 3 with c 2 , we have exp(c 2 ) -exp(c 3 ) = 2µ(A)(s -|S| 1 -µ(A)s + 2µ(A)s 1 -µ(A)s min(S k , P k )) ≥ 2µ(A)(s - |S| 1 -µ(A)s ) > 0 hold when |S| < s(1 -µ(A)s). Under this circumstance, we have q 1 = x s 1 ≤ |S|B < s(1 -µ(A)s)B ≤ sB. ( ) Note that x s is sampled from γ(B, s), Eq. ( 38) and ( 39) hold with the probability with of 1 -η, where η = s -|S| s = µ(A)s. (40) Similar to LISTA with support selection, there exists a t * * , when t > t * * , x (t) -x s 1 ≤ min i∈S (x s ) i . Therefore, supp(x (t) ) = S. Recall that c 3 < c 2 holds with the probability of 1 -η, we have t * * < t * hold with the probability of 1 -η. When we use the same settings for p (t) as LISTA-SS, we have same t satisfying P t > s. Let t 1 = max(t * * , t ), we have t 1 ≤ t 0 with the probability of 1 -η. When t > t 1 , we have x (t) i = 0 and i ∈ S pt . Same as Eq. ( 33) and (34), there is x (t+1) -x s 2 ≤ C (x (t) -x s ) 2 , where C = (I -U (t) A)(S, S) 2 ≤ (I -U (t) A)(S, S) F ≤ |S| 2 µ(A) 2 ≤ sµ(A).



Figure 1: The t-th layer of LISTA and EBT-LISTA.

Figure 2(a) shows how the learned parameters (in a logarithmic coordinate) vary with the index of layers in LISTA and EBT-LISTA. Note that the mean values are removed to align the range of the parameters of different models on the same yaxis. It can be seen that the obtained values for the parameter in EBT-LISTA do not change much from lower layers to higher layers, while the reconstruction errors in fact decrease. By contrast, the obtained threshold values in LISTA vary a lot across layers. Such results imply that the optimal thresholds in EBT-LISTA are indeed independent to (or say disentangled from) the reconstruction error, which confirms the theoretical result in Theorem 1. Similar observation can also be made on LISTA-SS (i.e., LISTA with support selection) and our EBT-LISTA-SS (i.e., EBT-LISTA with support selection), as shown in Figure 2(b).

Figure 2: Disentanglement of the reconstruction error and learnable parameters in our EBT. z (t)here indicates ρ (t) and b (t) for networks with or without EBT, respectively.. We analyze the obtained threshold values in EBT-LISTA and EBT-LISTA-SS, i.e., b (t) = ρ (t) U (t) (Ax (t) -y) p (with p = 2), and we compare them with the thresholds values obtained in LISTA and LISTA-SS. Note that the threshold values in our EBT-based models differ from sample to sample, we show the results in Figure3. It can be seen that the learned thresholds in our EBT-based methods and the original LISTA and LISTA-SS are similar, which indicates that the introduced EBT mechanism does not modify the training dynamics of the original methods, and our EBT works by disentangling the reconstruction error and learnable parameters.

Figure 3: Thresholds obtained from different methods across layers. . Validation of Theorem 2. Figure 4(a) shows how the NMSE of EBT-LISTA-SS varies with the index of layers. Besides the l 1 norm (i.e., b (t) = ρ (t) U (t) (Ax (t) -y) 1) concerned in the theorem, we also test EBT-LISTA-SS with the l 2 norm (i.e., letting b (t) = ρ (t) U (t) (Ax (t) -y) 2 ). It can be seen that, with both the l 1 and l 2 norms, EBT-LISTA-SS leads to consistently faster convergence than LISTA. Also, it is clear that there exist two convergence phases for EBT-LISTA-SS and LISTA-SS, and the later phase is indeed faster. With faster convergence, EBT-LISTA-SS finally achieves superior performance. The experiment is performed in the noiseless case with p b = 0.95. Similar observations can be made on the basis of other variants of LISTA (e.g., ALISTA, see Figure4(b)).

Figure 4: Validation of Theorem 2: there exist two convergence phases and our EBT accelerates the convergence of LISTA-SS, in particular in the first phase.. Adaptivity to unknown sparsity. As have been mentioned, in some practical scenarios, there may exist a gap between the training and test data distribution, or we may not know the distribution of real test data and will have to train on synthesized data based on the guess of the test distribution. Under such circumstances, it is of importance to consider the adaptivity/generalization of the sparse coding model (trained on a specific data distribution or with a specific sparsity) to test data sampled from different distributions with different sparsity. We conduct an experiment to test in such a scenario, in which we let the test sparsity be different from the training sparsity. Figure5shows the results in three different settings. The black curves represent the optimal model when LISTA is trained on exactly the same sparsity as that of the test data. It can be seen that our EBT has huge advantages in such a practical scenario where the adaptivity to un-trained sparsity is required, and the performance gain of LISTA is larger when the distribution shift between training and test is larger (cf. purple line and yellow line in Figure5(a) and 5(b)).

Figure 4: Validation of Theorem 2: there exist two convergence phases and our EBT accelerates the convergence of LISTA-SS, in particular in the first phase.. Adaptivity to unknown sparsity. As have been mentioned, in some practical scenarios, there may exist a gap between the training and test data distribution, or we may not know the distribution of real test data and will have to train on synthesized data based on the guess of the test distribution. Under such circumstances, it is of importance to consider the adaptivity/generalization of the sparse coding model (trained on a specific data distribution or with a specific sparsity) to test data sampled from different distributions with different sparsity. We conduct an experiment to test in such a scenario, in which we let the test sparsity be different from the training sparsity. Figure5shows the results in three different settings. The black curves represent the optimal model when LISTA is trained on exactly the same sparsity as that of the test data. It can be seen that our EBT has huge advantages in such a practical scenario where the adaptivity to un-trained sparsity is required, and the performance gain of LISTA is larger when the distribution shift between training and test is larger (cf. purple line and yellow line in Figure5(a) and 5(b)).

Figure 5: NMSE of different methods when the test sparsity is different from the training sparsity.We use "Origin" to indicate the optimal scenario where the LISTA models are trained on exactly the same sparsity as that of the test data.

Figure 6: NMSE of different sparse coding methods under different noise levels. It can be seen that our EBT performs favorably well under SNR=∞ and SNR=40dB.

Figure 7: NMSE of different sparse coding methods in different settings where different sparsity and different condition numbers are considered. When we vary the condition numbers, we fix p b =0.9.

(a) p b =0.8, LISTA-SS: ζ=0.055 (b) p b =0.8, EBT-LISTA-SS: ζ=0.041 (c) p b =0.9, LISTA-SS: ζ=0.0067 (d) p b =0.9, EBT-LISTA-SS: ζ=0.0026

Figure 8: Reconstruction 3D error maps of different methods in different settings. ζ here is the mean estimation error in degree. Note that the maximal error is 0.1 and 0.03 in theory when p b = 0.8 and p b = 0.9, respectively.

Figure 9: NMSE of different sparse coding methods in different settings where different sparsity and different condition numbers are considered.

Figure 10: NMSE of different sparse coding methods when the sparsity of the data follows a certain distribution.

Figure 11: NMSE of different sparse coding methods where different regularization coefficients λ are considered.

(t+1) -x s 1 = S i |(x (t+1) -x s ) i | ≤ S i (|(I -U (t) A)(x (t) -x s ) i | + |b (t) g(x U (t) A) ij (x (t) -x s ) j | + |b (t) g(x U (t) A) ij (x (t) -x s ) j | + -1)µ(A) x (t) -x s 1 + (|S| -min(S t+1 , P t+1 ))|b (t) | ≤ (|S| -1)µ(A) x (t) -x s 1 + (|S| -min(S t+1 , P t+1 ))µ(A) 1 + µ(A)s 1 -µ(A)s (x (t) -x s ) 1 ≤ ( 2 1 -µ(A)s |S| -1 + µ(A)s 1 -µ(A)s min(S t+1 , P t+1 ) -1)µ(A) (x (t) -x s ) 1 .

