CONNECTION-AND NODE-SPARSE DEEP LEARNING: STATISTICAL GUARANTEES

Abstract

Neural networks are becoming increasingly popular in applications, but a comprehensive mathematical understanding of their potentials and limitations is still missing. In this paper, we study the prediction accuracies of neural networks from a statistical point of view. In particular, we establish statistical guarantees for deep learning with different types of sparsity-inducing regularization. Our bounds feature a mild dependence on network widths and depths, and, therefore, support the current trend toward wide and deep networks. The tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.

1. INTRODUCTION

Sparsity reduces network complexities and, consequently, lowers the demands on memory and computation, reduces overfitting, and improves interpretability (Changpinyo et al., 2017; Han et al., 2016; Kim et al., 2016; Liu et al., 2015; Wen et al., 2016) . Three common notions of sparsity are connection sparsity, which means that there is only a small number of nonzero connections between nodes, node sparsity, which means that there is only a small number of active nodes (Alvarez & Salzmann, 2016; Changpinyo et al., 2017; Feng & Simon, 2017; Kim et al., 2016; Lee et al., 2008; Liu et al., 2015; Nie et al., 2015; Scardapane et al., 2017; Wen et al., 2016) , and layer sparsity, which means that there is only a small number of active layers (Hebiri & Lederer, 2020) . Approaches to achieving sparsity include augmenting small networks (Ash, 1989; Bello, 1992) , pruning large networks (Simonyan & Zisserman, 2015; Han et al., 2016 ), constraint estimation (Ledent et al., 2019; Neyshabur et al., 2015; Schmidt-Hieber, 2020) , and statistical regularization (Taheri et al., 2020) . The many empirical observations of the benefits of sparsity have sparked interest in mathematical support in the form of statistical theories. But such theories are still scarce and, in any case, have severe limitations. For example, statistical guarantees for deep learning with connection-sparse regularization have been established in Taheri et al. ( 2020), but they do not cover node sparsity, which, in view of the removal of entire nodes, has become especially popular. Moreover, their estimator involves an additional parameter, their theory is limited to a single output node, and their results have a suboptimal dependence on the input vectors. Statistical guarantees for constraint estimation over connection-and node-sparse networks follow from combining results in Neyshabur et al. (2015) and Bartlett & Mendelson (2002) . But for computational and practical reasons, regularized estimation is typically preferred over constraint estimation in deep learning as well as in machine learning at large (Hastie et al., 2015) . Moreover, their theory is limited to a single output node and ReLU activation, scales exponentially in the number of layers, and requires bounded loss functions. Statistical prediction guarantees for constraint estimation over connection-sparse networks have been derived in Schmidt-Hieber (2020), but their theory is limited to a single output node and ReLU activation and assumes bounded weights. In short, the existing statistical theory for deep learning with connection and node sparsity is still deficient. The goal of this paper is to provide an improved theory for sparse deep learning. We focus on regression-type settings with layered, feedforward neural networks. The estimators under consideration consist of a standard least-squares estimator with additional regularizers that induce connection or node sparsity. We then derive our guarantees by using techniques from high-dimensional statistics (Dalalyan et al., 2017) and empirical process theory (van de Geer, 2000). In the case of subgaussian noise, we find the rates l log [mnp] 3 n and mlp(log [mnp] 3 n for the connection-sparse and node-sparse estimators, respectively, where l is the number of hidden layers, m the number of output nodes, n the number of samples, p the total number of parameters, and p the maximal width of the network. The rates suggest that sparsity-inducing approaches can provide accurate prediction even in very wide (with connection sparsity) and very deep (with either type of sparsity) networks while, at the same time, ensuring low network complexities. These findings underpin the current trend toward sparse but wide and especially deep networks from a statistical perspective. Outline of the paper Section 2 recapitulates the notions of connection and node sparsity and introduces the corresponding deep learning framework and estimators. Section 3 confirms the empirically-observed accuracies of connection-and node-sparse estimation in theory. Section 4 summarizes the key features and limitations of our work. The Appendix contains all proofs.

2. CONNECTION-AND NODE-SPARSE DEEP LEARNING

We consider data (y 1 , x 1 ), . . . , (y n , x n ) ∈ R m × R d that are related via y i = g * [x i ] + u i for i ∈ {1, . . . , n} for an unknown data-generating function g * : R d → R m and unknown, random noise u 1 , . . . , u n ∈ R m . We allow all aspects, namely y i , g * , x i , and u i , to be unbounded. Our goal is to model the data-generating function with a feedforward neural network of the form g Θ [x] • • = Θ l f l Θ l-1 • • • f 1 [Θ 0 x] for x ∈ R d indexed by the parameter space M • • = {Θ = (Θ l , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j }. The functions f j : R p j → R p j are called the activation functions, and p 0 • • = d and p l+1 • • = m are called the input and output dimensions, respectively. The depth of the network is l, the maximal width is p • • = max j∈{0,...,l-1} p j+1 , and the total number of parameters is p • • = l j=0 p j+1 p j . In practice, the total number of parameters often rivals or exceeds the number of samples: p ≈ n or p n. We then speak of high dimensionality. A common technique for avoiding overfitting in high-dimensional settings is regularization that induces additional structures, such as sparsity. Sparsity has the interesting side-effect of reducing the networks' complexities, which can facilitate interpretations and reduce demands on energy and memory. Our first sparse estimator is Θ con ∈ arg min Θ∈M1 n i=1 y i -g Θ [x i ] 2 2 + r con |||Θ l ||| 1 (3) for a tuning parameter r con ∈ [0, ∞), a nonempty set of parameters M 1 ⊂ Θ ∈ M : max j∈{0,...,l-1} |||Θ j ||| 1 ≤ 1 , and the 1 -norm |||Θ j ||| 1 • • = p j+1 i=1 p j k=1 |(Θ j ) ik | for j ∈ {0, . . . , l}, Θ j ∈ R p j+1 ×p j . This estimator is an analog of the lasso estimator in linear regression (Tibshirani, 1996) . It induces sparsity on the level of connections: the larger the tuning parameter r con , the fewer connections among the nodes. Deep learning with 1 -regularization has become common in theory and practice (Kim et al., 2016; Taheri et al., 2020) . Our estimator (3) specifies one way to formulate this type of regularization. The estimator is indeed a regularized estimator (rather than a constraint estimator), because the complexity

