CONNECTION-AND NODE-SPARSE DEEP LEARNING: STATISTICAL GUARANTEES

Abstract

Neural networks are becoming increasingly popular in applications, but a comprehensive mathematical understanding of their potentials and limitations is still missing. In this paper, we study the prediction accuracies of neural networks from a statistical point of view. In particular, we establish statistical guarantees for deep learning with different types of sparsity-inducing regularization. Our bounds feature a mild dependence on network widths and depths, and, therefore, support the current trend toward wide and deep networks. The tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.

1. INTRODUCTION

Sparsity reduces network complexities and, consequently, lowers the demands on memory and computation, reduces overfitting, and improves interpretability (Changpinyo et al., 2017; Han et al., 2016; Kim et al., 2016; Liu et al., 2015; Wen et al., 2016) . Three common notions of sparsity are connection sparsity, which means that there is only a small number of nonzero connections between nodes, node sparsity, which means that there is only a small number of active nodes (Alvarez & Salzmann, 2016; Changpinyo et al., 2017; Feng & Simon, 2017; Kim et al., 2016; Lee et al., 2008; Liu et al., 2015; Nie et al., 2015; Scardapane et al., 2017; Wen et al., 2016) , and layer sparsity, which means that there is only a small number of active layers (Hebiri & Lederer, 2020) . Approaches to achieving sparsity include augmenting small networks (Ash, 1989; Bello, 1992) , pruning large networks (Simonyan & Zisserman, 2015; Han et al., 2016 ), constraint estimation (Ledent et al., 2019; Neyshabur et al., 2015; Schmidt-Hieber, 2020) , and statistical regularization (Taheri et al., 2020) . The many empirical observations of the benefits of sparsity have sparked interest in mathematical support in the form of statistical theories. But such theories are still scarce and, in any case, have severe limitations. For example, statistical guarantees for deep learning with connection-sparse regularization have been established in Taheri et al. ( 2020), but they do not cover node sparsity, which, in view of the removal of entire nodes, has become especially popular. Moreover, their estimator involves an additional parameter, their theory is limited to a single output node, and their results have a suboptimal dependence on the input vectors. Statistical guarantees for constraint estimation over connection-and node-sparse networks follow from combining results in Neyshabur et al. (2015) and Bartlett & Mendelson (2002) . But for computational and practical reasons, regularized estimation is typically preferred over constraint estimation in deep learning as well as in machine learning at large (Hastie et al., 2015) . Moreover, their theory is limited to a single output node and ReLU activation, scales exponentially in the number of layers, and requires bounded loss functions. Statistical prediction guarantees for constraint estimation over connection-sparse networks have been derived in Schmidt-Hieber (2020), but their theory is limited to a single output node and ReLU activation and assumes bounded weights. In short, the existing statistical theory for deep learning with connection and node sparsity is still deficient. The goal of this paper is to provide an improved theory for sparse deep learning. We focus on regression-type settings with layered, feedforward neural networks. The estimators under consideration consist of a standard least-squares estimator with additional regularizers that induce connection or node sparsity. We then derive our guarantees by using techniques from high-dimensional statistics (Dalalyan et al., 2017) and empirical process theory (van de Geer, 2000) . In the case of 1

