GENERALIZATION AND ESTIMATION ERROR BOUNDS FOR MODEL-BASED NEURAL NETWORKS

Abstract

Model-based neural networks provide unparalleled performance for various tasks, such as sparse coding and compressed sensing problems. Due to the strong connection with the sensing model, these networks are interpretable and inherit prior structure of the problem. In practice, model-based neural networks exhibit higher generalization capability compared to ReLU neural networks. However, this phenomenon was not addressed theoretically. Here, we leverage complexity measures including the global and local Rademacher complexities, in order to provide upper bounds on the generalization and estimation errors of model-based networks. We show that the generalization abilities of model-based networks for sparse recovery outperform those of regular ReLU networks, and derive practical design rules that allow to construct model-based networks with guaranteed high generalization. We demonstrate through a series of experiments that our theoretical insights shed light on a few behaviours experienced in practice, including the fact that ISTA and ADMM networks exhibit higher generalization abilities (especially for small number of training samples), compared to ReLU networks.

1. INTRODUCTION

Model-based neural networks provide unprecedented performance gains for solving sparse coding problems, such as the learned iterative shrinkage and thresholding algorithm (ISTA) (Gregor & LeCun, 2010) and learned alternating direction method of multipliers (ADMM) (Boyd et al., 2011) . In practice, these approaches outperform feed-forward neural networks with ReLU nonlinearities. These neural networks are usually obtained from algorithm unrolling (or unfolding) techniques, which were first proposed by Gregor and LeCun (Gregor & LeCun, 2010) , to connect iterative algorithms to neural network architectures. The trained networks can potentially shed light on the problem being solved. For ISTA networks, each layer represents an iteration of a gradient-descent procedure. As a result, the output of each layer is a valid reconstruction of the target vector, and we expect the reconstructions to improve with the network's depth. These networks capture original problem structure, which translates in practice to a lower number of required training data (Monga et al., 2021) . Moreover, the generalization abilities of model-based networks tend to improve over regular feed-forward neural networks (Behboodi et al., 2020; Schnoor et al., 2021) . Understanding the generalization of deep learning algorithms has become an important open question. The generalization error of machine learning models measures the ability of a class of estimators to generalize from training to unseen samples, and avoid overfitting the training (Jakubovitz et al., 2019) . Surprisingly, various deep neural networks exhibit high generalization abilities, even for increasing networks ' complexities (Neyshabur et al., 2015b; Belkin et al., 2019) . Classical machine learning measures such as the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1991) and Rademacher complexity (RC) (Bartlett & Mendelson, 2002) , predict an increasing generalization error (GE) with the increase of the models' complexity, and fail to explain the improved generalization observed in experiments. More advanced measures consider the training process and result in tighter bounds on the estimation error (EE), were proposed to investigate this gap, such as the local Rademacher complexity (LRC) (Bartlett et al., 2005) . To date, the EE of model based networks using these complexity measures has not been investigated to the best of our knowledge.

1.1. OUR CONTRIBUTIONS

In this work, we leverage existing complexity measures such as the RC and LRC, in order to bound the generalization and estimation errors of learned ISTA and learned ADMM networks. • We provide new bounds on the GE of ISTA and ADMM networks, showing that the GE of model-based networks is lower than that of the common ReLU networks. et al., 2017; Sokolić et al., 2017) , and analyzing the effect of multiple regularizations employed in deep learning, such as weight decay, early stopping, or drop-outs, on the generalization abilities (Neyshabur et al., 2015a; Gao & Zhou, 2016; Amjad et al., 2021) . Additional works consider global properties of the networks, such as a bound on the product of all Frobenius norms of the weight matrices in the network (Golowich et al., 2018) . However, these available bounds do not capture the GE behaviour as a function of network depth, where an increase in depth typically results in improved generalization. This also applies to the bounds on the GE of ReLU networks, detailed in Section 2.3. Recently, a few works focused on bounding the GE specifically for deep iterative recovery algorithms (Behboodi et al., 2020; Schnoor et al., 2021) . They focus on a broad class of unfolded networks for sparse recovery, and provide bounds which scale logarithmically with the number of layers (Schnoor et al., 2021) . However, these bounds still do not capture the behaviours experienced in practice. Much work has also focused on incorporating the networks' training process into the bounds. The LRC framework due to Bartlett, Bousquet, and Mendelson (Bartlett et al., 2005) assumes that the training process results in a smaller class of estimation functions, such that the distance between the estimator in the class and the empirical risk minimizer (ERM) is bounded. An additional related framework is the effective dimensionality due to Zhang (Zhang, 2002) . These frameworks result in different bounds, which relate the EE to the distance between the estimators. These local complexity measures were not applied to model-based neural networks. Throughout the paper we use boldface lowercase and uppercase letters to denote vectors and matrices respectively. The L 1 and L 2 norms of a vector x are written as ∥x∥ 1 and ∥x∥ 2 respectively, and the L ∞ (which corresponds to the maximal L 1 norm over the matrix's rows) and spectral norms of a matrix X, are denoted by ∥X∥ ∞ and ∥X∥ σ respectively. We denote the transpose operation by (•) T . For any function f and class of functions H, we define f • H = {x → f • h(x) : h ∈ H} .

2.1. NETWORK ARCHITECTURE

We focus on model-based networks for sparse vector recovery, applicable to the linear inverse problem y = Ax + e (1) where y ∈ R ny is the observation vector with n y entries, x ∈ R nx is the target vector with n x entries, with n x > n y , A ∈ R ny×nx is the linear operator, and e ∈ R ny is additive noise. The target vectors are sparse with sparsity rate ρ, such that at most ⌊ρn x ⌋ entries are nonzero. The inverse problem consists of recovering the target vector x, from the observation vector y. Given that the target vector is assumed to be sparse, recovering x from y in (1) can be formulated as an optimization problem, such as least absolute shrinkage and selection operator (LASSO) (Tibshirani & Ryan, 2013) , that can be solved with well-known iterative methods including ISTA and ADMM. To address more complex problems, such as an unknown linear mapping A, and to avoid having to fine tune parameters, these algorithms can be mapped into model-based neural networks using unfolding or unrolling techniques (Gregor & LeCun, 2010; Boyd et al., 2011; Monga et al., 2021; Yang et al., 2018) . The network's architecture imitates the original iterative method's functionality and enables to learn the models' parameters with respect to a set of training examples. We consider neural networks with L layers (referred to as the network's depth), which corresponds to the number of iterations in the original iterative algorithm. The layer outputs of an unfolded ISTA network h l I , l ∈ [1, L], are defined by the following recurrence relation, shown in Fig. 1 : h l I = S λ W l h l-1 I + b , h 0 I = S λ (b) where W l ∈ R nx×nx , l ∈ [1, L] are the weights matrices corresponding to each of the layers, with bounded norms ||W l || ∞ ≤ B l . We further assume that the L 2 norm of W 1 is bounded by B 1 . The vector b = A T y is a constant bias term that depends on the observation y, where we assume that the initial values are bounded, such that ||h 0 I || 1 ≤ B 0 . In addition, S λ (•) is the elementwise soft-thresholding operator S λ (h) = sign(h) max(|h| -λ, 0) ) where the functions sign(•) and max(•) are applied elementwise, and 0 is a vector of zeros. As S λ (•) is an elementwise function it preserves the input's dimension, and can be applied on scalar or vector inputs. The network's prediction is given by the last layer in the network x = h L I (y). We note that the estimators are functions mapping y to x, h L I : R ny -→ R nx , characterized by the weights, i.e. h L I = h L I ({W l } L l=1 ). The class of functions representing the output at depth L in an ISTA network, is H L I = {h L I ({W l } L l=1 ) : ||W l || ∞ ≤ B l l ∈ [1, L], ||W 1 || 2 ≤ B 1 }. Similarly, the lth layer of unfolded ADMM is defined by the following recurrence relation h l A = W l z l-1 + u l-1 + b z l = S λ h l A -u l-1 , z 0 = 0 u l = u l-1 -γ h l A -z l , u 0 = 0 (4) where 0 is a vector of zeros, b = A T y is a constant bias term, and γ > 0 is the step size derived by the original ADMM algorithm, as shown in Fig. 1 . The estimators satisfy h L A : R ny -→ R nx , and the class of functions representing the output at depth L in an ADMM network, is H L A = {h L A ({W l } L l=1 ) : ||W l || ∞ ≤ B l l ∈ [1, L], ||W 1 || 2 ≤ B 1 } , where we impose the same assumptions on the weights matrices as ISTA networks. For a depth-L ISTA or ADMM network, the learnable parameters are the weight matrices {W l } L l=1 . The weights are learnt by minimizing a loss function L on a set of m training examples S = {(x i , y i )} m i=1 , drawn from an unknown distribution D, consistent with the model in (1). We consider the case where the per-example loss function is obtained by averaging over the example per-coordinate losses: L (h(y), x) = 1 n x nx j=1 ℓ (h j (y), x j ) (5) where h j (y) and x j denote the jth coordinate of the estimated and true targets, and ℓ : R × R -→ R + is 1-Lipschitz in its first argument. This requirement is satisfied in many practical settings, for example with the p-power of L p norms, and is also required in a related work (Xu et al., 2016) . The loss of an estimator h ∈ H which measures the difference between the true value x and the estimation h(y), is denoted for convenience by L(h) = L (h(y), x). There exists additional forms of learned ISTA and ADMM networks, which include learning an additional set of weight matrices affecting the bias terms (Monga et al., 2021) . Also, the optimal value of λ generally depends on the target vector sparsity level. Note however that for learned networks, the value of λ at each layer can also be learned. However, here we focus on a more basic architecture with fixed λ in order to draw theoretical conclusions. 

2.2. GENERALIZATION AND ESTIMATION ERRORS

In this work, we focus on upper bounding the GE and EE of the model-based neural networks of Fig. 1 . The GE of a class of estimation functions h ∈ H, such that h : R ny -→ R nx , is defined as G(H) = E S sup h∈H L D (h) -L S (h) where L D (h) = E D L(h) is the expected loss with respect to the data distribution D (the joint probability distribution of the targets and observations), L S (h) = 1 m m i=1 L(h(y i ), x i ) is the average empirical loss with respect to the training set S, and E S is the expectation over the training datasets. The GE is a global property of the class of estimators, which captures how the class of estimators H is suited to the learning problem. Large GE implies that there are hypotheses in H for which L D deviates much from L S , on average over S. However, the GE in (6) does not capture how the learning algorithm chooses the estimator h ∈ H. In order to capture the effect of the learning algorithm, we consider local properties of the class of estimators, and focus on bounding the estimation error (EE) E(H) = L D ĥ -inf h∈H L D (h) (7) where ĥ is the ERM satisfying L S ( ĥ) = inf h∈H L S (h). We note that the ERM approximates the learned estimator h learned , which is obtained by training the network on a set of training examples S, using algorithms such as SGD. However, the estimator h learned depends on the optimization algorithm, and can differ from the ERM. The difference between the empirical loss associated with ĥ and the empirical loss associated with h learned is usually referred to as the optimization error. Common deep neural network architectures have large GE compared to the low EE achieved in practice. This still holds, when the networks are not trained with explicit regularization, such as weight decay, early stopping, or drop-outs (Srivastava et al., 2014; Neyshabur et al., 2015b) . This empirical phenomena is experienced across various architectures and hyper-parameter choices (Liang et al., 2019; Novak et al., 2018; Lee et al., 2018; Neyshabur et al., 2018) .

2.3. RADEMACHER COMPLEXITY BASED BOUNDS

The RC is a standard tool which captures the ability of a class of functions to approximate noise, where a lower complexity indicates a reduced generalization error. Formally, the empirical RC of a class of scalar estimators H, such that h : R ny -→ R for h ∈ H, over m samples is R m (H) = E {ϵi} m i=1 sup h∈H 1 m m i=1 ϵ i h(y i ) where {ϵ i } m i=1 are independent Rademacher random variables for which P r  (ϵ i = 1) = P r (ϵ i = -1) = 1/2, (H) ≤ 2E S R m (L • H) ( ) where L is defined in Section 2.1 and Ben-David, 2014) . We observe that the class of functions L • H consists of scalar functions, such that L • H = {(x, y) → L(h(y), x) : h ∈ H} (Shalev-Shwartz & f : R nx × R nx - → R for f ∈ L • H. Therefore, in order to bound the GE of the class of functions defined by ISTA and ADMM networks, we can first bound their RC. Throughout the paper, we compare the model-based architectures with a feed forward network with ReLU activations, given by ReLU(h) = max (h, 0). In this section, we review existing bounds for the generalization error of these networks. The layers of a ReLU network h l R , ∀l ∈ [1, L], are defined by the following recurrence relation h l R = ReLU W l h l-1 R + b , and h 0 R = ReLU(b) , where W l , ∀l ∈ [1, L] are the weight matrices which satisfy the same conditions as the weight matrices of ISTA and ADMM networks. The class of functions representing the output at depth L in a ReLU network, is  H L R = {h L R ({W l } L l=1 ) : ||W l || ∞ ≤ B l l ∈ [1, L], ||W 1 || 2 ≤ B 1 }. c such that R m H ′,L R ≥ cB 0 L l=1 B ′ l , where H ′,L R = {h L R ({W l } L l=1 ) : ||W l || σ ≤ B ′ l l ∈ [1, L]}. This result shows that using the RC framework the GE of ReLU networks behaves as the product of the weight matrices' norms L l=1 B l , as captured in Theorem 1. Theorem 2, implies that the dependence on the weight matrices' norms, cannot be substantially improved with the RC framework for ReLU networks.

3. GENERALIZATION ERROR BOUNDS: GLOBAL PROPERTIES

In this section, we derive theoretical bounds on the GE of learned ISTA and ADMM networks. From these bounds we deduce design rules to construct ISTA and ADMM networks with a GE which does not increase exponentially with the number of layers. We start by presenting theoretical guarantees on the RC of any class of functions, after applying the soft-thresholding operation. Soft-thresholding is a basic block that appears in multiple iterative algorithms, and therefore is used as the nonlinear activation in many model-based networks. It results from the proximal gradient of the L 1 norm (Palomar & Eldar, 2010). We therefore start by presenting the following lemma which expresses how the RC of a class of functions is affected by applying soft-thresholding to each function in the class. The proof is provided in the supplementary material. Lemma 1 (Rademacher complexity of soft-thresholding). Given any class of scalar functions H where h : R n -→ R, h ∈ H for any integer n ≥ 1, and m i.i.d. training samples, R m (S λ • H) ≤ R m (H) - λT m , where T = m i=1 T i and T j = E {ϵi} m i=2 1 h ⋆ (yj )>λ∧h ′⋆ (yj )<-λ such that h * , h ′ * = arg max h,h ′ ∈H h(y 1 ) -h ′ (y 1 ) -2λ1 h(y1)>λ∧h ′ (y1)<-λ + m i=2 ϵ i (S λ (h(y i )) + S λ (h ′ (y i ))) . The quantity T is a non-negative value obtained during the proof, which depends on the networks' number of layers, underlying data distribution D and soft-threshold value λ. As seen from ( 10), the value of T dictates the reduction in RC due to soft-thresholding, where a reduction in the RC can also be expected. The value of T increases as λ decreases. In the case that λT increases with λ, higher values of λ further reduce the RC of the class of functions H, due to the soft-thresholding. We now focus on the class of functions representing the output of a neuron at depth L in an ISTA network, H L I . In the following theorem, we bound its GE using the RC framework and Lemma 1. The proof is provided in the supplementary material. Theorem 3 (Generalization error bound for ISTA networks). Consider the class of learned ISTA networks of depth L as described in (2), and m i.i.d. training samples. Then there exist T (l) for l ∈ [1, L] in the range T (l) ∈ 0, min mB l G l-1 I λ , m such that G H L I ≤ 2E S G L I , where G l I =   B 0 l l ′ =1 B l ′ √ m - λ m l-1 l ′ =1 T (l ′ ) l j=l ′ +1 B j - λT (l) m   . ( ) Next, we show that for a specific distribution, the expected value of T (l) is greater than 0. Under an additional bound on the expectation value of the estimators (specified in the supplementary material) E S T (l) ≥ max m 1 -2e -(cB l b (l) -λ) , 0 where b (l+1) = B l b (l) -λ, b (1) = B 0 , and c ∈ (0, 1]. Increasing λ or decreasing B, will decrease the bound in ( 12), since crossing the threshold is less probable. Depending on λ and {B l } L l=1 (specifically that λ ≤ cB l b (l) + ln 2), the bound in ( 12) is positive, and enforces a non-zero reduction in the GE. Along with Theorem 3, this shows the expected reduction in GE of ISTA networks compared to ReLU netowrks. The reduction is controlled by the value of the soft threshold. To obtain a more compact relation, we can choose the maximal matrices' norm B = max l∈[1,L] B l , and denote T = min l∈[1,L] T (l) ∈ [0, m] which leads to G H L I ≤ 2 B0B L √ m -λE S (T ) m B L -1 B-1 . Comparing this bound with the GE bound for ReLU networks presented in Theorem 1, shows the expected reduction due to the soft thresholding activation. This result also implies practical rules for designing low generalization error ISTA networks. We note that the network's parameters such as the soft-threshold value λ and number of samples m, are predefined by the model being solved (for example, in ISTA, the value of λ is chosen according to the singular values of A). We derive an implicit design rule from (3), for a nonincreasing GE, as detailed in Section A.2. This is done by restricting the matrices' norm to satisfy B ≤ 1 + λE S (T ) √ mB0 . Moreover, these results can be extended to convolutional neural networks. As convolution operations can be expressed via multiplication with a convolution matrix, the presented results are also satisfied in that case. Similarly, we bound the GE of the class of functions representing the output at depth L in an ADMM network, H L A . The proof and discussion are provided in the supplementary material. Theorem 4 (Generalization error bound for ADMM networks). Consider the class of learned ADMM networks of depth L as described in (4), and m i.i.d. training samples. Then there exist T (l) for l ∈ [1, L -1] in the interval T (l) ∈ 0, min m Bl G l-1 A λ , m where G l A = B0 l-1 l ′ =1 Bl ′ √ m - λ m l-2 l ′ =1 T (l ′ ) l-1 j=l ′ +1 Bj -λT (l-1) m , and λ = (1 + γ)λ, Bl = (1 + 2γ)(B l + 2), l ∈ [1, L], such that G H L A ≤ 2 BL E S G L-1 A . We compare the GE bounds for ISTA and ADMM networks, to the bound on the GE of ReLU networks presented in Theorem 1. We observe that both model-based networks achieve a reduction in the GE, which depends on the soft-threshold, the underlying data distribution, and the bound on the norm of the weight matrices. Following the bound, we observe that the soft-thresholding nonlinearity is most valuable in the case of small number of training samples. The soft-thresholding nonlinearity is the key that enables reducing the GE of the ISTA and ADMM networks compared to ReLU networks. Next, we focus on bounding the EE of feed-forward networks based on the LRC framework.

4. ESTIMATION ERROR BOUNDS: LOCAL PROPERTIES

To investigate the model-based networks' EE, we use the LRC framework (Bartlett et al., 2005) . Instead of considering the entire class of functions H, the LRC considers only estimators which are close to the optimal estimator H r = h ∈ H : E D ∥h(y) -h * (y)∥ 2 2 ≤ r , where h * is such that L D (h * ) = inf h∈H L D (h). It is interesting to note that the class of estimators H r only restricts the distance between the estimators themselves, and not between their corresponding losses. Following the LRC framework, we consider target vectors ranging in [-1, 1] nx . Therefore, we adapt the networks' estimations by clipping them to lie in the interval [-1, 1] nx . In our case we consider the restricted classes of functions representing the output of a neuron at depth L in ISTA, ADMM, and ReLU networks. Moreover, we denote by W l and W l, * , l ∈ [1, L] the weight matrices corresponding to h ∈ H r and h * , respectively. Based on these restricted class of functions, we present the following assumption and theorem (the proof is provided in the supplementary material). Assumption 1. There exists a constant C ≥ 1 such that for every probability distribution D, and estimator h ∈ H, E D nx j=1 (h j -h * j ) 2 ≤ CE D nx j=1 ℓ(h j ) -ℓ(h * j ) , where h j and h * j denote the jth coordinate of the estimators. As pointed out in (Bartlett & Mendelson, 2002) , this condition usually follows from a uniform convexity condition on the loss function ℓ. For instance, if |h(y) -x| ≤ 1 for any h ∈ H, y ∈ R ny and x ∈ R nx , then the condition is satisfied with C = 1 (Yousefi et al., 2018) . , such that for any s > 0 with probability at least 1 -e -s , E H L I ≤ 41r * + 17C 2 + 48C mn x s ( ) where r * = C 2 α 2 B0B L-1 2 L √ m -λT m η 2 . The bound is also satisfied for the class of functions represented by depth-L ADMM networks H L A , with r * = C 2 α 2 B0 BL-2 2 L-1 √ m -λT m η 2 , where λ = (1 + γ)λ, B = (1 + 2γ)(B + 2), and η = (L-1) BL-2 ( B-1)-BL-1 +1 ( B-1) 2 , and for the class of functions represented by depth-L ReLU networks H L R , with r * = C 2 α 2 B0B L-1 2 L √ m 2 . From Theorem 5, we observe that the EE decreases by a factor of O (1/m), instead of a factor of O (1/ √ m) obtained for the GE. This result complies with previous local bounds which yield faster convergence rates compared to global bounds (Blanchard et al., 2007; Bartlett et al., 2005) . Also, the value of α relates the maximal distance between estimators in H r denoted by r, to the distance between their corresponding weight matrices ||W l -W l, * || ∞ ≤ α √ r, l ∈ [1, L]. Tighter bounds on the distance between the weight matrices allow us to choose a smaller value for α, resulting in smaller values r * which improve the EE bounds. The value of α could depend on the network's nonlinearity, underlying data distribution D, and number of layers. Note that the bounds of the model-based architectures depend on the soft-thresholding through the value of λE S (T ). As λE S (T ) increases, the bound on the EE decreases, which emphasizes the nonlinearity's role in the network's generalization abilities. Due to the soft-thresholding, ISTA and ADMM networks result in lower EE bounds compared to the bound for ReLU networks. It is interesting to note, that as the number of training samples m increases, the difference between the bounds on the model-based and ReLU networks is less significant. In the EE bounds of model-based networks, the parameter B 0 relates the bound to the sparsity level ρ, of the target vectors. Lower values of ρ result in lower EE bounds, as demonstrated in the supplementary.

5. NUMERICAL EXPERIMENTS

In this section, we present a series of experiments that concentrate on how a particular model-based network (ISTA network) compares to a ReLU network, and showcase the merits of model-based networks. We focus on networks with 10 layers (similar to previous works (Gregor & LeCun, 2010) ), to represent realistic model-based network architectures. The networks are trained on a simulated dataset to solve the problem in (1), with target vectors uniformly distributed in [-1, 1]. The linear mapping A is constructed from the real part of discrete Fourier transform (DFT) matrix rows (Ong et al., 2019) , where the rows are randomly chosen. The sparsity rate is ρ = 0.15, and the noise's standard deviation is 0.1. To train the networks we used the SGD optimizer with the L 1 loss over all neurons of the last layer. The target and noise vectors are generated as element wise independently from a uniform distribution ranging in [-1, 1] . All results are reproducible through (Authors, 2022) which provides the complete code to execute the experiments presented in this section. We concentrate on comparing the networks' EE, since in practice, the networks are trained with a finite number of examples. In order to empirically approximate the EE of a class of networks H, we use an empirical approximation of h * (which satisfy L D (h * ) = inf h∈H L D (h)) and the ERM ĥ, denoted by h * emp and ĥemp respectively. The estimator h * emp results from the trained network with 10 4 samples, and the ERM is approximated by a network trained using SGD with m training samples (where m ≤ 10 4 ). The empirical EE is given by their difference L D ( ĥemp ) -L D (h * emp ). In Fig. 2 , we compare between the ISTA and ReLU networks, in terms of EE and L 1 loss, for networks trained with different number of samples (between 10 and 10 4 samples). We observe that for small number of training samples, the ISTA network substantially reduces the EE compared to the ReLU network. This can be understood from Theorem 5, which results in lower EE bounds on the ISTA networks compared to ReLU networks, due to the term λE S (T ). However, for large number of samples the EE of both networks decreases to zero, which is also expected from Theorem 5. This highlights that the contribution of the soft-thresholding nonlinearity to the generalization abilities of the network is more significant for small number of training samples. Throughout the paper, we considered networks with constant bias terms. In this section, we also consider learned bias terms, as detailed in the supplementary material. In Fig. 2 , we present the experimental results for networks with constant and learned biases. The experiments indicate that the choice of constant or learned bias is less significant to the EE or the accuracy, compared to the choice of nonlinearity, emphasizing the relevance of the theoretical guarantees. The cases of learned and constant biases have different optimal estimators, as the networks with learned biases have more learned parameters. As a result, it is plausible a network with more learnable parameters (the learnable bias) exhibits a lower estimation error since the corresponding optimal estimator also exhibits a lower estimation error. To analyze the effect of the soft-thresholding value on the generalization abilities of the ISTA network, we show in Fig. 3 , the empirical EE for multiple values of λ. The experimental results demonstrate that for small number of samples, increasing λ reduces the EE. As expected from the EE bounds, for a large number of training samples, this dependency on the nonlinearity vanishes, and the EE is similar for all values of λ. In Fig. 3b , we show the L 1 loss of the ISTA networks for different values of λ. We observe that low estimation error does not necessarily lead to low loss value. For m ≲ 100, increasing λ reduced the EE. These results suggest that given ISTA networks with different values of λ such that all networks achieve similar accuracy, the networks with higher values of λ provide lower EE. These results are also valid for additional networks' depths. In Section B, we compare the EE for networks with different number of layers, and show that they exhibit a similar behaviour. 

6. CONCLUSION

We derived new GE and EE bounds for ISTA and ADMM networks, based on the RC and LRC frameworks. Under suitable conditions, the model-based networks' GE is nonincreasing with depth, resulting in a substantial improvement compared to the GE bound of ReLU networks. The EE bounds explain EE behaviours experienced in practice, such as ISTA networks demonstrating higher estimation abilities, compared to ReLU networks, especially for small number of training samples. Through a series of experiments, we show that the generalization abilities of ISTA networks are controlled by the soft-threshold value, and achieve lower EE along with a more accurate recovery compared to ReLU networks which increase the GE and EE with the networks' depths. It is interesting to consider how the theoretical insights can be harnessed to enforce neural networks with high generalization abilities. One approach is to introduce an additional regularizer during the training process that is rooted in the LRC, penalizing networks with high EE (Yang et al., 2019) .



Figure 1: A single layer in the unfolded networks. a. Unfolded ISTA. b. Unfolded ADMM. The learnable parameters are the weight matrices (marked in red).

Theorem 5 (Estimation error bound of ISTA, ADMM, and ReLU networks). Consider the class of functions represented by depth-L ISTA networks H L I as detailed in Section 2.1, m i.i.d training samples, and a per-coordinate loss satisfying Assumption 1 with a constant C. Let ||W l -W l, * || ∞ ≤ α √ r for some α > 0. Moreover, B ≥ max{α √ r, 1}. Then there exists T in the interval T ∈ 0, min √ mB0B L-1 2 L λη , m , where η = LB L-1 (B-1)-B L +1 (B-1) 2

Figure 2: Comparing the EE of ISTA and ReLU networks with 10 layers. a. The EE of the networks as a function of the training samples. b. The L 1 of the networks as a function of the training samples. The ISTA network achieves lower EE compared to the ReLU network, along with lower losses.

Figure 3: Estimation error and loss of ISTA networks with 10 layers, as a function of the softthreshold's value λ. The experiments indicate that the estimation error decreases as λ increases, which demonstrates a way to control the networks' generalization abilities through λ.

The derivation of the theoretical guarantees combines existing proof techniques for computing the generalization error of multilayer networks with new methodology for bounding the RC of the soft-thresholding operator, that allows a better understanding of the generalization ability of model based networks.

the samples {y i } m i=1 are obtained by the model in (1) from m i.i.d. target vectors {x i } m i=1 drawn from an unknown distribution. Taking the expectation of the RC of L • H with respect to the set of examples S presented in Section 2.1, leads to a bound on the GE G

This architecture leads to the following bound on the GE. Theorem 1 (Generalization error bound for ReLU networks(Gao & Zhou, 2016)). Consider the class of feed forward networks of depth-L with ReLU activations, H L R , as described in Section 2.3, and m i.i.d. training samples. Given a 1-Lipschitz loss function, its GE satisfies G H L Follows from applying the bound of the RC of ReLU neural networks from (Gao & Zhou, 2016) and combining it with (9).The bound in Theorem 1 is satisfied for any feed forward network with 1-Lipschitz nonlinear activations (including ReLU), and can be generalized for networks with activations with different Lipshcitz constants. We show in Theorem 2, that the bound presented in Theorem 1 cannot be substantially improved for ReLU networks with the RC framework. Theorem 2 (Lower Rademacher complexity bound for ReLU networks(Bartlett et al., 2017)). Consider the class of feed forward networks of depth-L with ReLU activations, where the weight matrices have bounded spectral norm||W l || σ ≤ B ′ l , l ∈ [1, L].The dimension of the output layer is 1, and the dimension of each non-output layer is at least 2. Given m i.i.d. training samples, there exists a

funding

* This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research, the innovation programme (grant agreement No. 101000967), and the Israel Science Foundation under Grant 536/22. Y. C. Eldar and M. R. D. Rodrigues are supported by The Weizmann-UK Making Connections Programme (Ref. 129589). M. R. D. Rodrigues is also supported by the Alan Turing Institute. The authors wish to thank Dr. Gholamali Aminian from the Alan Turing Institute, UK, for his contribution to the proofs' correctness.

