GENERALIZATION AND ESTIMATION ERROR BOUNDS FOR MODEL-BASED NEURAL NETWORKS

Abstract

Model-based neural networks provide unparalleled performance for various tasks, such as sparse coding and compressed sensing problems. Due to the strong connection with the sensing model, these networks are interpretable and inherit prior structure of the problem. In practice, model-based neural networks exhibit higher generalization capability compared to ReLU neural networks. However, this phenomenon was not addressed theoretically. Here, we leverage complexity measures including the global and local Rademacher complexities, in order to provide upper bounds on the generalization and estimation errors of model-based networks. We show that the generalization abilities of model-based networks for sparse recovery outperform those of regular ReLU networks, and derive practical design rules that allow to construct model-based networks with guaranteed high generalization. We demonstrate through a series of experiments that our theoretical insights shed light on a few behaviours experienced in practice, including the fact that ISTA and ADMM networks exhibit higher generalization abilities (especially for small number of training samples), compared to ReLU networks.

1. INTRODUCTION

Model-based neural networks provide unprecedented performance gains for solving sparse coding problems, such as the learned iterative shrinkage and thresholding algorithm (ISTA) (Gregor & LeCun, 2010) and learned alternating direction method of multipliers (ADMM) (Boyd et al., 2011) . In practice, these approaches outperform feed-forward neural networks with ReLU nonlinearities. These neural networks are usually obtained from algorithm unrolling (or unfolding) techniques, which were first proposed by Gregor and LeCun (Gregor & LeCun, 2010) , to connect iterative algorithms to neural network architectures. The trained networks can potentially shed light on the problem being solved. For ISTA networks, each layer represents an iteration of a gradient-descent procedure. As a result, the output of each layer is a valid reconstruction of the target vector, and we expect the reconstructions to improve with the network's depth. These networks capture original problem structure, which translates in practice to a lower number of required training data (Monga et al., 2021) . Moreover, the generalization abilities of model-based networks tend to improve over regular feed-forward neural networks (Behboodi et al., 2020; Schnoor et al., 2021) . Understanding the generalization of deep learning algorithms has become an important open question. The generalization error of machine learning models measures the ability of a class of estimators to generalize from training to unseen samples, and avoid overfitting the training (Jakubovitz et al., 2019) . Surprisingly, various deep neural networks exhibit high generalization abilities, even for increasing networks' complexities (Neyshabur et al., 2015b; Belkin et al., 2019) . Classical machine learning measures such as the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1991) and Rademacher complexity (RC) (Bartlett & Mendelson, 2002) , predict an increasing generalization error (GE) with the increase of the models' complexity, and fail to explain the improved generalization observed in experiments. More advanced measures consider the training process and result in tighter bounds on the estimation error (EE), were proposed to investigate this gap, such as the local Rademacher complexity (LRC) (Bartlett et al., 2005) . To date, the EE of model based networks using these complexity measures has not been investigated to the best of our knowledge.

1.1. OUR CONTRIBUTIONS

In this work, we leverage existing complexity measures such as the RC and LRC, in order to bound the generalization and estimation errors of learned ISTA and learned ADMM networks. et al., 2017; Sokolić et al., 2017) , and analyzing the effect of multiple regularizations employed in deep learning, such as weight decay, early stopping, or drop-outs, on the generalization abilities (Neyshabur et al., 2015a; Gao & Zhou, 2016; Amjad et al., 2021) . Additional works consider global properties of the networks, such as a bound on the product of all Frobenius norms of the weight matrices in the network (Golowich et al., 2018) . However, these available bounds do not capture the GE behaviour as a function of network depth, where an increase in depth typically results in improved generalization. This also applies to the bounds on the GE of ReLU networks, detailed in Section 2.3. Recently, a few works focused on bounding the GE specifically for deep iterative recovery algorithms (Behboodi et al., 2020; Schnoor et al., 2021) . They focus on a broad class of unfolded networks for sparse recovery, and provide bounds which scale logarithmically with the number of layers (Schnoor et al., 2021) . However, these bounds still do not capture the behaviours experienced in practice. Much work has also focused on incorporating the networks' training process into the bounds. The LRC framework due to Bartlett, Bousquet, and Mendelson (Bartlett et al., 2005) assumes that the training process results in a smaller class of estimation functions, such that the distance between the estimator in the class and the empirical risk minimizer (ERM) is bounded. An additional related framework is the effective dimensionality due to Zhang (Zhang, 2002) . These frameworks result in



We provide new bounds on the GE of ISTA and ADMM networks, showing that the GE of model-based networks is lower than that of the common ReLU networks. The derivation of the theoretical guarantees combines existing proof techniques for computing the generalization error of multilayer networks with new methodology for bounding the RC of the soft-thresholding operator, that allows a better understanding of the generalization ability of model based networks. • The obtained bounds translate to practical design rules for model-based networks which guarantee high generalization. In particular, we show that a nonincreasing GE as a function of the network's depth is achievable, by limiting the weights' norm in the network. This improves over existing bounds, which exhibit a logarithmic increase of the GE with depth (Schnoor et al., 2021). The GE bounds of the model-based networks suggest that under similar restrictions, learned ISTA networks generalize better than learned ADMM networks. • We also exploit the LRC machinery to derive bounds on the EE of feed-forward networks, such as ReLU, ISTA, and ADMM networks. The EE bounds depend on the data distribution and training loss. We show that the model-based networks achieve lower EE bounds compared to ReLU networks. • We focus on the differences between ISTA and ReLU networks, in term of performance and generalization. This is done through a series of experiments for sparse vector recovery problems. The experiments indicate that the generalization abilities of ISTA networks are controlled by the soft-threshold value. For a proper choice of parameters, ISTA achieves lower EE along with more accurate recovery. The dependency of the EE as a function of λ and the number of training samples can be explained by the derived EE bounds. 1.2 RELATED WORK Understanding the GE and EE of general deep learning algorithms is an active area of research. A few approaches were proposed, which include considering networks of weights matrices with bounded norms (including spectral and L 2,1 norms) (Bartlett

funding

* This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research, the innovation programme (grant agreement No. 101000967), and the Israel Science Foundation under Grant 536/22. Y. C. Eldar and M. R. D. Rodrigues are supported by The Weizmann-UK Making Connections Programme (Ref. 129589). M. R. D. Rodrigues is also supported by the Alan Turing Institute. The authors wish to thank Dr. Gholamali Aminian from the Alan Turing Institute, UK, for his contribution to the proofs' correctness.

