CONNECTION-AND NODE-SPARSE DEEP LEARNING: STATISTICAL GUARANTEES

Abstract

Neural networks are becoming increasingly popular in applications, but a comprehensive mathematical understanding of their potentials and limitations is still missing. In this paper, we study the prediction accuracies of neural networks from a statistical point of view. In particular, we establish statistical guarantees for deep learning with different types of sparsity-inducing regularization. Our bounds feature a mild dependence on network widths and depths, and, therefore, support the current trend toward wide and deep networks. The tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.

1. INTRODUCTION

Sparsity reduces network complexities and, consequently, lowers the demands on memory and computation, reduces overfitting, and improves interpretability (Changpinyo et al., 2017; Han et al., 2016; Kim et al., 2016; Liu et al., 2015; Wen et al., 2016) . Three common notions of sparsity are connection sparsity, which means that there is only a small number of nonzero connections between nodes, node sparsity, which means that there is only a small number of active nodes (Alvarez & Salzmann, 2016; Changpinyo et al., 2017; Feng & Simon, 2017; Kim et al., 2016; Lee et al., 2008; Liu et al., 2015; Nie et al., 2015; Scardapane et al., 2017; Wen et al., 2016) , and layer sparsity, which means that there is only a small number of active layers (Hebiri & Lederer, 2020) . Approaches to achieving sparsity include augmenting small networks (Ash, 1989; Bello, 1992) , pruning large networks (Simonyan & Zisserman, 2015; Han et al., 2016) , constraint estimation (Ledent et al., 2019; Neyshabur et al., 2015; Schmidt-Hieber, 2020) , and statistical regularization (Taheri et al., 2020) . The many empirical observations of the benefits of sparsity have sparked interest in mathematical support in the form of statistical theories. But such theories are still scarce and, in any case, have severe limitations. For example, statistical guarantees for deep learning with connection-sparse regularization have been established in Taheri et al. (2020) , but they do not cover node sparsity, which, in view of the removal of entire nodes, has become especially popular. Moreover, their estimator involves an additional parameter, their theory is limited to a single output node, and their results have a suboptimal dependence on the input vectors. Statistical guarantees for constraint estimation over connection-and node-sparse networks follow from combining results in Neyshabur et al. (2015) and Bartlett & Mendelson (2002) . But for computational and practical reasons, regularized estimation is typically preferred over constraint estimation in deep learning as well as in machine learning at large (Hastie et al., 2015) . Moreover, their theory is limited to a single output node and ReLU activation, scales exponentially in the number of layers, and requires bounded loss functions. Statistical prediction guarantees for constraint estimation over connection-sparse networks have been derived in Schmidt-Hieber (2020) , but their theory is limited to a single output node and ReLU activation and assumes bounded weights. In short, the existing statistical theory for deep learning with connection and node sparsity is still deficient. The goal of this paper is to provide an improved theory for sparse deep learning. We focus on regression-type settings with layered, feedforward neural networks. The estimators under consideration consist of a standard least-squares estimator with additional regularizers that induce connection or node sparsity. We then derive our guarantees by using techniques from high-dimensional statistics (Dalalyan et al., 2017) and empirical process theory (van de Geer, 2000) . In the case of subgaussian noise, we find the rates l log [mnp] 3 n and mlp(log [mnp] 3 n for the connection-sparse and node-sparse estimators, respectively, where l is the number of hidden layers, m the number of output nodes, n the number of samples, p the total number of parameters, and p the maximal width of the network. The rates suggest that sparsity-inducing approaches can provide accurate prediction even in very wide (with connection sparsity) and very deep (with either type of sparsity) networks while, at the same time, ensuring low network complexities. These findings underpin the current trend toward sparse but wide and especially deep networks from a statistical perspective. Outline of the paper Section 2 recapitulates the notions of connection and node sparsity and introduces the corresponding deep learning framework and estimators. Section 3 confirms the empirically-observed accuracies of connection-and node-sparse estimation in theory. Section 4 summarizes the key features and limitations of our work. The Appendix contains all proofs.

2. CONNECTION-AND NODE-SPARSE DEEP LEARNING

We consider data (y 1 , x 1 ), . . . , (y n , x n ) ∈ R m × R d that are related via y i = g * [x i ] + u i for i ∈ {1, . . . , n} for an unknown data-generating function g * : R d → R m and unknown, random noise u 1 , . . . , u n ∈ R m . We allow all aspects, namely y i , g * , x i , and u i , to be unbounded. Our goal is to model the data-generating function with a feedforward neural network of the form g Θ [x] • • = Θ l f l Θ l-1 • • • f 1 [Θ 0 x] for x ∈ R d indexed by the parameter space M • • = {Θ = (Θ l , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j }. The functions f j : R p j → R p j are called the activation functions, and p 0 • • = d and p l+1 • • = m are called the input and output dimensions, respectively. The depth of the network is l, the maximal width is p • • = max j∈{0,...,l-1} p j+1 , and the total number of parameters is p • • = l j=0 p j+1 p j . In practice, the total number of parameters often rivals or exceeds the number of samples: p ≈ n or p n. We then speak of high dimensionality. A common technique for avoiding overfitting in high-dimensional settings is regularization that induces additional structures, such as sparsity. Sparsity has the interesting side-effect of reducing the networks' complexities, which can facilitate interpretations and reduce demands on energy and memory. Our first sparse estimator is Θ con ∈ arg min Θ∈M1 n i=1 y i -g Θ [x i ] 2 2 + r con |||Θ l ||| 1 (3) for a tuning parameter r con ∈ [0, ∞), a nonempty set of parameters M 1 ⊂ Θ ∈ M : max j∈{0,...,l-1} |||Θ j ||| 1 ≤ 1 , and the 1 -norm |||Θ j ||| 1 • • = p j+1 i=1 p j k=1 |(Θ j ) ik | for j ∈ {0, . . . , l}, Θ j ∈ R p j+1 ×p j . This estimator is an analog of the lasso estimator in linear regression (Tibshirani, 1996) . It induces sparsity on the level of connections: the larger the tuning parameter r con , the fewer connections among the nodes. Deep learning with 1 -regularization has become common in theory and practice (Kim et al., 2016; Taheri et al., 2020) . Our estimator (3) specifies one way to formulate this type of regularization. The estimator is indeed a regularized estimator (rather than a constraint estimator), because the complexity is regulated entirely through the tuning parameter r con in the objective function (rather than through a tuning parameter in the set over which the objective function is optimized). But 1 -regularization could also be formulated slightly differently. For example, one could consider the estimators Θ con ∈ arg min Θ∈M n i=1 y i -g Θ [x i ] 2 2 + r con l j=0 |||Θ j ||| 1 (4) or Θ con ∈ arg min Θ∈M n i=1 y i -g Θ [x i ] 2 2 + r con l j=0 |||Θ j ||| 1 . (5) The differences among the estimators (3)-( 5) are small: for example, our theory can be adjusted for (4) with almost no changes of the derivations. The differences among the estimators mainly concern the normalizations of the parameters; we illustrate this in the following proposition. Proposition 1 (Scaling of Norms). Assume that the all-zeros parameter (0 p l+1 ×p l , . . . , 0 p 1 ×p 0 ) ∈ M 1 is neither a solution of (3) nor of (5), that r con > 0, and that the activation functions are nonnegative homogenous: f j [ab] = af j [b] for all j ∈ {1, . . . , l}, a ∈ [0, ∞), and b ∈ R p j . Then, |||( Θ con ) 0 ||| 1 , . . . , |||( Θ con ) l-1 ||| 1 = 1 (concerns the inner layers) for all solutions of (3), while |||( Θ con ) 0 ||| 1 = • • • = |||( Θ con ) l ||| 1 ( concerns all layers) for at least one solution of (5). Another way to formulate 1 -regularization was proposed in Taheri et al. (2020) : they reparametrize the networks through a scale parameter and a constraint version of M and then to focus the regularization on the scale parameter only. Our above-stated estimator (3) is more elegant in that it avoids the reparametrization and the additional parameter. The factor |||Θ l ||| 1 in the regularization term of (3) measures the complexity of the network over the set M 1 , and the factor r con regulates the complexity of the resulting estimator. This provides a convenient lever for data-adaptive complexity regularization through well-established calibration schemes for the tuning parameter, such as cross-validation. This practical aspect is an advantage of regularized formulations like ours as compared to constraint estimation over sets with a predefined complexity. The constraints in the set M 1 of the estimator (3) can also retain the expressiveness of the full parameterization that corresponds to the set M: for example, assuming again nonnegative-homogeneous activation, one can check that for every Γ ∈ M, there is a Γ ∈ {Θ ∈ M : max j∈{0,...,l-1} |||Θ j ||| 1 ≤ 1} such that g Γ = g Γ -cf. Taheri et al. (2020, Proposition 1) . In contrast, existing theories on neural networks often require the parameter space to be bounded, which limits the expressiveness of the networks. Our regularization approach is, therefore, closer to practical setups than constraint approaches. (1989, Lemma (3.3 )) (because that would require a bounded loss). We instead invoke ideas from high-dimensional statistics, prove Lipschitz properties for neural networks, and use empirical process theory that is based on chaining (see the Appendix). Our second estimator is Θ node ∈ arg min Θ∈M2,1 n i=1 y i -g Θ [x i ] 2 2 + r node |||Θ l ||| 2,1 for a tuning parameter r node ∈ [0, ∞), a nonempty set of parameters M 2,1 ⊂ Θ ∈ M : max j∈{0,...,l-1} |||Θ j ||| 2,1 ≤ 1 , and the 2 / 1 -norm |||Θ j ||| 2,1 • • = p j k=1 p j+1 i=1 |(Θ j ) ik | 2 for j ∈ {0, . . . , l -1}, Θ j ∈ R p j+1 ×p j . This estimator is an analog of the group-lasso estimator in linear regression (Bakin, 1999) . Again, to avoid ambiguities in the regularization, our formulation is slightly different from the standard formulations in the literature, but the fact that group-lasso regularizers leads to node-sparse networks has been discussed extensively before (Alvarez & Salzmann, 2016; Liu et al., 2015; Scardapane et al., 2017) : the larger the tuning parameter r node , the fewer active nodes in the network. The above-stated comments about the specific form of the connection-sparse estimator also apply to the node-sparse estimator. An illustration of connection and node sparsity is given in Figure 1 . Connection-sparse networks have only a small number of active connections between nodes (left panel of Figure 1 ); node-sparse networks have inactive nodes, that is, completely unconnected nodes (right panel of Figure 1 ). The two notions of sparsity are connected: for example, connection sparsity can render entire nodes inactive "by accident" (see the layer that follows the input layer in the left panel of the figure). In general, node sparsity is the weaker assumption, because it allows for highly connected nodes; this observation is reflected in the theoretical guarantees in the following section. The optimal network architecture for given data (such as the optimal width) is hardly known beforehand in a data analysis. A main feature of sparsity-inducing regularization is, therefore, that it adjusts parts of the network architecture to the data. In other words, sparsity-inducing regularization is a data-driven approach to adapting the complexity of the network. While versions of the estimators (3) and ( 6) are popular in deep learning, statistical analyses, especially of node-sparse deep learning, are scarce. Such a statistical analysis is, therefore, the goal of the following section.

3. STATISTICAL PREDICTION GUARANTEES

We now develop statistical guarantees for the sparse estimators described above. The guarantees are formulated in terms of the squared average (in-sample) prediction error err[Θ] • • = 1 n n i=1 g * [x i ] -g Θ [x i ] 2 2 for Θ ∈ M , which is a measure for how well the network g Θ fits the unknown function g * (which does not need to be a neural network) on the data at hand, and in terms of the prediction risk (or generalization error) for a new sample (y, x) that has the same distribution as the original data risk[Θ] • • = E||y -g Θ [x]|| 2 2 for Θ ∈ M , which measures how well the network g Θ can predict a new sample. We first study the prediction error, because it is agnostic to the distribution of the input data; in the end, we then translate the bounds for the prediction error into bounds for the generalization error. We first observe that the networks in (2) can be somewhat "linearized:" For every parameter Θ ∈ M 1 , there is a parameter Θ ∈ M 1 • • = Θ = (Θ l-1 , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j , max j∈{0,...,l-1} |||Θ j ||| 1 ≤ 1 such that for every x ∈ R d g Θ [x] = Θ l g Θ [x] with g Θ [x] • • = f l Θ l-1 • • • f 1 [Θ 0 x] ∈ R p l . This additional notation allows us to disentangle the outermost layer (which is regularized directly) from the other layers (which are regularized indirectly). More generally speaking, the additional notation makes a connection to linear regression, where the above holds trivially with g Θ [x] = x. We also define M 2,1 • • = Θ = (Θ l-1 , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j , max j∈{0,...,l-1} |||Θ j ||| 2,1 ≤ 1 accordingly. In high-dimensional linear regression, the quantity central to prediction guarantees is the effective noise (Lederer & Vogt, 2020) . The effective noise is in our notation (with l = 0 and m = 1 to describe linear regression) 2|| n i=1 u i x i || ∞ . The above linearization allows us to generalize the effective noise to our general deep-learning framework: r * con • • = 2 sup Ψ∈M1 n i=1 u i g Ψ [x i ] ∞ r * node • • = 2 √ m sup Ψ∈M2,1 n i=1 u i g Ψ [x i ] ∞ , where |||A||| ∞ • • = max (i,j)∈{1,...,m}×{1,...,p l } |A ij | for A ∈ R m×p l . The effective noises, as we will see below, are the optimal tuning parameters in our theories; at the same time, the effective noises depend on the noise random variables u 1 , . . . , u n , which are unknown in practice. Accordingly, we call the quantities r * con and r * node the oracle tuning parameters. We take a moment to compare the effective noises in (8) to Rademacher complexities (Koltchinskii, 2001; Koltchinskii & Panchenko, 2002) . Rademacher complexities are the basis of a line of other statistical theories for deep learning (Bartlett & Mendelson, 2002; Golowich et al., 2017; Lederer, 2020a; Neyshabur et al., 2015) . In our framework, the Rademacher complexities in the case m = 1 are (Lederer, 2020a We can now state a general prediction guarantee. Theorem 1 (General Prediction Guarantees). If r con ≥ r * con , it holds that err[ Θ con ] ≤ inf Θ∈M1 err[Θ] + 2r con n |||Θ l ||| 1 . Similarly, if r node ≥ r * node , it holds that err[ Θ node ] ≤ inf Θ∈M2,1 err[Θ] + 2r node n |||Θ l ||| 2,1 . Each bound contains an approximation error err[Θ] that captures how well the class of networks can approximate the true data-generating function g * and a statistical error proportional to r con /n and r node /n, respectively, that captures how well the estimator can select within the class of networks at hand. In other words, Theorem 1 ensures that the estimators (3) and ( 6) predict-up to the statistical error described by r con /n and r node /n, respectively-as well as the best connection-and node-sparse network. This observation can be illustrated further: Corollary 1 (Parametric Setting). If additionally g * = g Θ * for a Θ * ∈ M 1 , it holds that err[ Θ con ] ≤ 2r con n |||(Θ * ) l ||| 1 . If instead g * = g Θ * for a Θ * ∈ M 2,1 , it holds that err[ Θ node ] ≤ 2r node n |||(Θ * ) l ||| 2,1 . Hence, if the underlying data-generating function is a sparse network itself, the prediction errors of the estimators are essentially bounded by the statistical errors r con /n and r node /n. The above-stated results also identify the oracle tuning parameters r * con and r * node as optimal tuning parameters: they give the best prediction guarantees in Theorem 1. But since the oracle tuning parameters are unknown in practice, the guarantees implicitly presume a calibration scheme that satisfies r con ≈ r * con in practice. A natural candidate is cross-validation, but there are no guarantees that cross-validation provides such tuning parameters. This is a limitation that our theories share with all other theories in the field. Rather than dealing with the practical calibration of the tuning parameters, we exemplify the oracle tuning parameters in a specific setting. This analysis will illustrate the rates of convergences that we can expect from Theorem 1, and it will allow us to compare our theories with other theories in the literature. Assume that the activation functions satisfy f j [0 p j ] = 0 p j and are 1-Lipschitz continuous with respect to the Euclidean norms on the functions' input and output spaces R p j . A popular example is ReLU activation (Nair & Hinton, 2010) , but the conditions are met by many other functions as well. Also, assume that the noise vectors u 1 , . . . , u n are independent and centered and have uniformly subgaussian entries (van de Geer, 2000, Display (8.2) on Page 126). Keep the input vectors fixed and capture their normalizations by v ∞ • • = 1 n n i=1 ||x i || 2 ∞ and v 2 • • = 1 n n i=1 ||x i || 2 2 . Then, we obtain the following bounds for the effective noises. Proposition 2 (Subgaussian Noise). There is a constant c ∈ (0, ∞) that depends only on the subgaussian parameters of the noise such that P r * con ≤ cv ∞ nl log[2mnp] 3 ≥ 1 - 1 n and P r * node ≤ cv 2 mnlp log[2mnp] 3 ≥ 1 - 1 n . Broadly speaking, this result combined with Theorem 1 illustrates that accurate prediction with connection-and node-sparse estimators is possible even when using very wide and deep networks. Let us analyze the factors one by one and compare them to the factors in the bounds of Taheri et al. (2020) and Neyshabur et al. (2015) , which are the two most related papers. The connection-sparse case compares to the results in Taheri et al. (2020) , and it compares to the results in Neyshabur et al. (2015) when setting the parameters in that paper to p = q = 1 (which gives a setting that is slightly more restrictive than ours) or p = 1; q = ∞ (which gives a setting that is slightly less restrictive than ours). The node-sparse case compares to Neyshabur et al. (2015) with p = 2; q = ∞ (which gives a setting that is more restrictive than ours, though). Our setup is also more general than the one in Neyshabur et al. (2015) in the sense that it allows for activation other than ReLU. The dependence on n is, as usual, 1/ √ n up to logarithmic factors. In the connection-sparse case, our bounds involve Neyshabur et al. (2015) or the factor v 2 = n i=1 ||x i || 2 2 /n of Taheri et al. (2020) . In principle, the improvements of v ∞ over v ∞ and v 2 can be up to a factor √ n and up to a factor √ d, respectively; in practice, the improvements depend on the specifics on the data. For example, on the training data of MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017)  v ∞ = n i=1 ||x i || 2 ∞ /n rather than the factor v ∞ • • = max i∈{1,...,n} ||x i || ∞ of ( √ n ≈ 250; √ d = 28 in both data sets), it holds that v ∞ ≈ v ∞ ≈ v 2 /9 and v ∞ ≈ v ∞ ≈ v 2 /12 , respectively. In the node-sparse case, our bounds involve v 2 , which is again somewhat smaller than the factor v 2 • • = max i∈{1,...,n} ||x i || 2 in Neyshabur et al. (2015) . The main difference between the bounds for the connection-sparse and node-sparse estimators are their dependencies on the networks' maximal width p. The bound for the connection-sparse estimator (3) depends on the width p only logarithmically (through p), while the bound for the node-sparse estimator (6) depends on p sublinearly. The dependence in the connection-sparse case is the same as in Taheri et al. (2020) , while Neyshabur et al. (2015) can avoid even that logarithmic dependence (and, therefore, allow for networks with infinite widths). The node-sparse case in Neyshabur et al. (2015) does not involve our linear dependence on the width, but this difference stems from the fact that they use a more restrictive version of the grouping-we take the maximum over each layer, while they take the maximum over each node-and our results can be readily adjusted to their notion of group sparsity. These observations indicate that node sparsity as formulated above is suitable for slim networks (p n) but should be strengthened or complemented with other notions of sparsity otherwise. To give a numeric example, the training data in MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017) comprise n = 60 000 samples, which means that the width should be considerably smaller than 60 000 when using node sparsity alone. (Note that the input layer does not take part in p, which means that d could be larger.) For unconstraint estimation, one can expect a linear dependence of the error on the total number of parameters (Anthony & Bartlett, 1999) . Our bounds for the sparse estimators, in contrast, only have a log[p] dependence on the total number of parameters. This difference illustrates the virtue of regularization in general, and the virtue of sparsity in particular.

√

l dependence on the depth. These dependencies considerably improve on the exponentially-increasing dependencies on the depth in Neyshabur et al. (2015) and, therefore, are particularly suited to describe deep network architectures. Replacing the conditions max j |||Θ j ||| 1 ≤ 1 and max j |||Θ j ||| 2,1 ≤ 1 in the definitions of the connection-sparse and node-sparse estimators by the stricter conditions j |||Θ j ||| 1 ≤ 1 and j |||Θ j ||| 2,1 ≤ 1, respectively (cf. Taheri et al. (2020) and our discussion in Section 2), the dependence on the depth can be improved further from √ l to (2/l) l √ l (this only requires a simple adjustment of the last display in the proof of Proposition 4), which is exponentially decreasing in the depth. Our connection-sparse bounds have a mild log[m] dependence on the number of output nodes; the node-sparse bound involve an additional factor √ m. The case of multiple outputs has not been considered in statistical prediction bounds before. Proposition 2 also highlights another advantage of our regularization approach over theories such as Neyshabur et al. (2015) that apply to constraint estimators. The theories for constraint estimators require bounding the sparsity levels directly, but in practice, suitable values for these bounds are rarely known. In our framework, in contrast, the sparsity is controlled via tuning parameters indirectly, and Proposition 2-although not providing a complete practical calibration scheme-gives insights into how these tuning parameters should scale with n, d, l, and so forth. We also note that the bounds in Theorem 1 can be generalized readily to every estimator of the form . For example, one could impose connection sparsity on some layers and node sparsity on others, or one could impose different regularizations altogether. We omit the details to avoid digression. Θ gen ∈ arg min Θ∈Mgen n i=1 y i -g Θ [x i ] 2 2 + r gen |||Θ l ||| , where r gen ∈ [0, ∞) is a We finally illustrate that the bounds for the prediction errors also entail bounds for the generalization errors. For simplicity, we consider a parametric setting and subgaussian noise again. Proposition 3 (Generalization Guarantees). Assume that the inputs x, x 1 , . . . , x n are i.i.d. random vectors, that the noise vectors u 1 , . . . , u n are independent and centered and have uniformly subgaussian entries, and that r * con , r * node → 0 as n → ∞. Consider an arbitrary positive constant b ∈ (0, ∞). If g * = g Θ * for a Θ * ∈ M 1 that is independent of the sample size n, it holds with probability at least 1 -1/n that risk[ Θ con ] ≤ (1 + b) risk[Θ * ] + cv ∞ l log[2mnp] 3 n |||(Θ * ) l ||| 1 for a constant c ∈ (0, ∞) that depends only on b and the subgaussian parameters of the noise. Similarly, if g * = g Θ * for a Θ * ∈ M 2,1 that is independent of the sample size n, it holds with probability at least 1 -1/n that risk[ Θ con ] ≤ (1 + b) risk[Θ * ] + cv 2 mlp log[2mnp] 3 n |||(Θ * ) l ||| 2,1 for a constant c ∈ (0, ∞) that depends only on b and the subgaussian parameters of the noise. Hence, the generalization errors are bounded by the same terms as the prediction errors.

4. DISCUSSION

Our statistical theory for sparse deep learning incorporates node sparsity as well as connection sparsity, scales favorably in the number of layers, provides insights into how the tuning parameters should scale with the dimensions of the problem, and applies to unbounded loss functions. It is the first statistical theory that has all of these features-cf. Table 1 . Additionally we avoid the introduction of an additional scaling parameter and improve the dependence of the rates on the input data. Finally, our novel proof approach based on high-dimensional statistics and empirical-process theory is of independent interest. Evidence for the benefits of deep networks has been established in practice (LeCun et al., 2015; Schmidhuber, 2015) , approximation theory (Liang & Srikant, 2016; Telgarsky, 2016; Yarotsky, 2017) , and statistics (Golowich et al., 2017; Taheri et al., 2020) . Since our guarantees scale at most sublinearly in the number of layers (or even improve with increasing depth-see our comment on Page 7), our paper complements these lines of research and shows that sparsity-inducing regularization is an effective approach to coping with the complexity of deep and very deep networks. Connection sparsity limits the number of nonzero entries in each parameter matrix, while layer sparsity only limits the total number of nonzero rows. Hence, the number of columns in a parameter matrix, that is, the width of the preceding layer, is regularized only in the case of connection sparsity. Our theoretical results reflect this insight in that the bounds for the connection-and node-sparse estimators depend on the networks' width logarithmically and sublinearly, respectively. Practically speaking, our results indicate that connection sparsity is suitable to handle wide networks, but node sparsity is suitable only when complemented by connection sparsity or other strategies. The mild logarithmic dependence of our connection-sparse bounds on the number of output nodes illustrates that networks with many outputs can be learned in practice. Our prediction theory is the first one that considers multiple output nodes; a classification theory with a logarithmic dependence on the output nodes has been established very recently in Ledent et al. (2019) . The mathematical underpinnings of our theory are very different from those of most other papers in theoretical deep learning. The proof of the main theorem shares similarities with proofs in highdimensional statistics; to formulate and control the relevant empirical processes, we use the concept of effective noise, chaining, and Lipschitz properties of neural networks. These tools are not standard in deep learning theory and, therefore, might be of more general interest (see Appendix A.7 for further details). Our theory shares some limitations with all other current theories in deep learning: the network architectures are simpler than the ones typically used in practice (cf. Lederer (2020b) , though); the bounds concern global optima rather than the local optima or saddle points provided by many practical algorithms; and the theory does not entail a practical scheme for the calibration of the tuning parameters. Nevertheless, our theory, and mathematical theory in general, provides insights about what accuracies to expect in practice and about what network types and estimators might be suitable for a given problem. In summary, our paper highlights the benefits of sparsity in deep learning and, more generally, showcases the usefulness of statistical analyses for understanding neural networks.

A APPENDIX

The Appendix consists of two auxiliary results and the proofs of Theorem 1 and Propositions 1 and 2. Our approach combines techniques from high-dimensional statistics and empirical-process theory that are very different from the techniques used in most other approaches in the literature. A.1 LIPSCHITZ PROPERTY In this section, we prove a Lipschitz property that we use in the proof of Proposition 2. Proposition 4 (Lipschitz Property). In the framework of Sections 2 and 3, it holds for all Θ, Γ ∈ M 1 that g Θ [x] -g Γ [x] ∞ ≤ √ l||x|| ∞ |||Θ -Γ||| F and for all Θ, Γ ∈ M 2,1 that g Θ [x] -g Γ [x] 2 ≤ √ l||x|| 2 |||Θ -Γ||| F . The Frobenius norm is defined as |||Θ||| F • • = l-1 j=0 |||Θ j ||| 2 F • • = l-1 j=0 p j+1 i=1 p j k=1 |(Θ j ) ik | 2 for Θ ∈ M 2,1 = M 1 ∪ M 2,1 . Proposition 4 generalizes (Taheri et al., 2020 , Proposition 2) to vector-valued network outputs and to node sparsity, and it replaces their ||x|| 2 with the smaller ||x|| ∞ in the connection-sparse case. Proof of Proposition 4. This proof generalizes and sharpens the proof of Taheri et al. (2020) , and it simplifies some arguments of that proof. We define the "inner subnetworks" of a network g Θ with Θ ∈ M 2,1 as the vector-valued functions S 0 g Θ : R d → R p 1 x → S 0 g Θ [x] • • = Θ 0 x and S j g Θ : R d → R p j+1 x → S j g Θ [x] • • = Θ j f j • • • f 1 [Θ 0 x] for j ∈ {1, . . . , l -1}. Similarly, we define the "outer subnetworks" of g Θ as the real-valued functions S j g Θ : R p j → R p l z → S j g Θ [z] • • = f l Θ l-1 • • • f j [z] for j ∈ {1, . . . , l -1} and S l g Θ : R p l → R p l z → S l g Θ [z] • • = f l [z] . The initial network can be split into an inner and an outer network along every layer j ∈ {1, . . . , l}: g Θ [x] = S j g Θ S j-1 g Θ [x] for x ∈ R d . We call this our splitting argument. To exploit the splitting argument, we derive a contraction result for the inner subnetworks and a Lipschitz result for the outer subnetworks. We denote the 2 -operator norm of a matrix A, that is, the largest singular value of A, by |||A||| op . Using then the assumptions that the activation functions are 1-Lipschitz and f j [0 p j ] = 0 p j , we get for every Θ = (Θ l-1 , . . . , Θ 0 ) ∈ M 2,1 and x ∈ R d that S j-2 g Θ [x] 2 = Θ j-2 f j-2 S j-3 g Θ [x] 2 ≤ |||Θ j-2 ||| op f j-2 S j-3 g Θ [x] 2 ≤ |||Θ j-2 ||| op S j-3 g Θ [x] 2 ≤ • • • ≤ j-2 k=1 |||Θ k ||| op ||Θ 0 x|| 2 ≤ j-2 k=0 |||Θ k ||| op ||x|| 2 for all j ∈ {2, . . . , l}. Now, since |||Θ k ||| op ≤ |||Θ k ||| F ≤ |||Θ k ||| 2,1 and Θ ∈ M 2,1 , we can deduce from the display that S j-2 g Θ [x] 2 ≤ j-2 k=0 |||Θ k ||| 2,1 ||x|| 2 . This inequality is our contraction property. By similar arguments, we get for every z 1 , z 2 ∈ R p j that S j g Θ [z 1 ] -S j g Θ [z 2 ] 2 = f l Θ l-1 • • • f j [z 1 ] -f l Θ l-1 • • • f j [z 2 ] 2 ≤ Θ l-1 f l-1 • • • f j [z 1 ] -Θ l-1 f l-1 • • • f j [z 2 ] 2 ≤ |||Θ l-1 ||| op f l-1 • • • f j [z 1 ] -f l-1 • • • f j [z 2 ] 2 ≤ • • • ≤ l-1 k=j |||Θ k ||| op ||z 1 -z 2 || 2 for j ∈ {1, . . . , l}, where l-1 k=l |||Θ k ||| op • • = 1. Hence, similarly as above, S j g Θ [z 1 ] -S j g Θ [z 2 ] 2 ≤ l-1 k=j |||Θ k ||| 2,1 ||z 1 -z 2 || 2 . This inequality is our Lipschitz property. We now use the contraction and Lipschitz properties of the subnetworks to derive a Lipschitz result for the entire network. We consider two networks g Θ and g Γ with parameters Θ = (Θ l-1 , . . . , Θ 0 ) ∈ M 2,1 and Γ = (Γ l-1 , . . . , Γ 0 ) ∈ M 2,1 , respectively. Our above-derived splitting argument applied with j = 1 and j = l, respectively, yields g Θ [x] -g Γ [x] 2 = S 1 g Θ S 0 g Θ [x] -S l g Γ S l-1 g Γ [x] 2 . Elementary algebra and the fact that S j-1 g Θ [S j-2 g Γ [x]] = S j g Θ [Θ j-1 f j-1 [S j-2 g Γ [x] ] for j ∈ {2, . . . , l} then allow us to derive g Θ [x] -g Γ [x] 2 = S 1 g Θ S 0 g Θ [x] - l j=1 S j g Θ S j-1 g Γ [x] -S j g Θ S j-1 g Γ [x] -S l g Γ S l-1 g Γ [x] 2 = S 1 g Θ S 0 g Θ [x] -S 1 g Θ S 0 g Γ [x] - l j=2 S j g Θ S j-1 g Γ [x] -S j-1 g Θ S j-2 g Γ [x] + S l g Θ S l-1 g Γ [x] -S l g Γ S l-1 g Γ [x] 2 = S 1 g Θ S 0 g Θ [x] -S 1 g Θ S 0 g Γ [x] - l j=2 S j g Θ S j-1 g Γ [x] -S j g Θ Θ j-1 f j-1 S j-2 g Γ [x] + S l g Θ S l-1 g Γ [x] -S l g Γ S l-1 g Γ [x] 2 ≤ S 1 g Θ S 0 g Θ [x] -S 1 g Θ S 0 g Γ [x] 2 + l j=2 S j g Θ S j-1 g Γ [x] -S j g Θ Θ j-1 f j-1 S j-2 g Γ [x] 2 + S l g Θ S l-1 g Γ [x] -S l g Γ S l-1 g Γ [x] 2 . We bound this further by using the above-derived Lipschitz property of the outer networks and the observation that S l g Θ [S l-1 g Γ [x]] = S l g Γ [S l-1 g Γ [x]]: g Θ [x] -g Γ [x] 2 ≤ l-1 k=1 |||Θ k ||| 2,1 S 0 g Θ [x] -S 0 g Γ [x] 2 + l j=2 l-1 k=j |||Θ k ||| 2,1 S j-1 g Γ [x] -Θ j-1 f j-1 S j-2 g Γ [x] 2 , which is by the definition of the inner networks equivalent to g Θ [x] -g Γ [x] 2 ≤ l-1 k=1 |||Θ k ||| 2,1 ||Θ 0 x -Γ 0 x|| 2 + l j=2 l-1 k=j |||Θ k ||| 2,1 Γ j-1 f j-1 S j-2 g Γ [x] -Θ j-1 f j-1 S j-2 g Γ [x] 2 . Using the properties of the operator norm, we can deduce from this inequality that g Θ [x] -g Γ [x] 2 ≤ l-1 k=1 |||Θ k ||| 2,1 |||Θ 0 -Γ 0 ||| op ||x|| 2 + l j=2 l-1 k=j |||Θ k ||| 2,1 |||Γ j-1 -Θ j-1 ||| op f j-1 S j-2 g Γ [x] 2 . Invoking the mentioned conditions on the activation functions and the contraction property for the inner subnetworks then yields g Θ [x] -g Γ [x] 2 ≤ max v∈{0,...,l-1} k∈{0,...,l-1} k =v max |||Θ k ||| 2,1 , |||Γ k ||| 2,1 l-1 j=0 |||Γ j -Θ j ||| op ||x|| 2 ≤ √ l||x|| 2 |||Θ -Γ||| F . The proof for the connection-sparse case is almost the same. 

A.2 ENTROPY BOUND

In this section, we establish bounds for the entropies of M 1 and M 2,1 . The distance between two networks g Θ and g Γ is defined as dist [g Θ , g Γ ] • • = n i=1 ||g Θ [x i ] -g Γ [x i ]|| 2 ∞ /n. Given this distance function and a radius t ∈ (0, ∞), the metric entropy of a nonempty set A ⊂ {Θ = (Θ l-1 , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j } is denoted by H[t, A]. We then get the following entropy bounds. Lemma 1 (Entropy Bounds). In the framework of Sections 2 and 3, it holds for a constant c H ∈ (0, ∞) and every t ∈ (0, ∞) that H[t, M 1 ] ≤ c H (v ∞ ) 2 l t 2 log pt 2 (v ∞ ) 2 l + 2 and H[t, M 2,1 ] ≤ c H (v ∞ ) 2 lp t 2 log pt 2 (v ∞ ) 2 l + 2 . Proof of Lemma 1. The first bound can be derived by combining established deterministic and randomization arguments (Carl, 1985) ; (Lederer, 2010 , Proof of Theorem 1.1); (Taheri et al., 2020, Proposition 3) . For the second bound, observe that |||Θ j ||| 1 = p j+1 i=1 p j k=1 |(Θ j ) ik | ≤ p j+1 p j k=1 p j+1 i=1 |(Θ j ) ik | 2 = p j+1 |||Θ j ||| 2,1 = p|||Θ j ||| 2,1 for all j ∈ {0, . . . , l -1} and Θ j ∈ R p j+1 ×p j . We used in turn 1. the definition of the ||| • ||| 1 -norm on Page 2, 2. the linearity and interchangeability of finite sums and the inequality ||a|| 1 ≤ √ b||a|| 2 for all a ∈ R b , 3. the definition of the ||| • ||| 2,1 -norm on Page 4, and 4. the definition of the width p on Page 2. Hence, M 2,1 ⊂ √ pM 1 . A bound for the entropies of M 2,1 can, therefore, be derived from the first bound by replacing the radii t on the right-hand side by t/ √ p.

A.3 PROOF OF THEOREM 1

In this section, we state a proof for Theorem 1. The proof is inspired by derivations in highdimensional statistics-see, for example, (Zhuang & Lederer, 2018) and references therein. Proof of Theorem 1. The main idea of the proof is to contrast the estimators' objective functions evaluated at their minima with the estimators' objective functions at other points. Our first step is to derive what we call a basic inequality. By the definition of the estimator in (6), it holds for every Θ ∈ M 2,1 that n i=1 y i -g Θ [x i ] 2 2 + r node ||| Θ l ||| 2,1 ≤ n i=1 y i -g Θ [x i ] 2 2 + r node |||Θ l ||| 2,1 , where we use the shorthand Θ • • = Θ node . We then invoke the model in (1) to rewrite this inequality as n i=1 g * [x i ] + u i -g Θ [x i ] 2 2 + r node ||| Θ l ||| 2,1 ≤ n i=1 g * [x i ] + u i -g Θ [x i ] 2 2 + r node |||Θ l ||| 2,1 . Expanding the squared terms and rearranging the inequality then yields n i=1 g * [x i ] -g Θ [x i ] 2 2 ≤ n i=1 g * [x i ] -g Θ [x i ] 2 2 + 2 n i=1 g Θ [x i ] u i -2 n i=1 g Θ [x i ] u i + r node |||Θ l ||| 2,1 -r node ||| Θ l ||| 2,1 . This is our basic inequality. In the remainder of the proof, we need to bound the first two terms in the last line of the basic inequality. We call these terms the empirical process terms. Using the reformulation of the networks in (7), we can write the empirical process term of a general parameter Γ ∈ M 2,1 according to 2 n i=1 g Γ [x i ] u i = 2 n i=1 Γ l g Γ [x i ] u i with Γ ∈ M 2,1 . Using the 1. the properties of transpositions, 2. the definition of the trace function, 3. the cyclic property of the trace function, and 4. the linearity of the trace function yields further 2 n i=1 g Γ [x i ] u i = 2 n i=1 g Γ [x i ] (Γ l ) u i = 2 n i=1 trace g Γ [x i ] (Γ l ) u i = 2 n i=1 trace u i g Γ [x i ] (Γ l ) = 2 trace n i=1 u i g Γ [x i ] (Γ l ) . Now, 1. denoting the column-vector that corresponds to the kth column of a matrix A by A •k , 2. using Hölder's inequality, 3. using Hölder's inequality again, and 4. again Hölder's inequality and our definitions of the elementwise ∞ -and 1 -norms, we find 2 n i=1 g Γ [x i ] u i = 2 p l k=1 n i=1 u i g Γ [x i ] •k , (Γ l ) •k ≤ 2 p l k=1 n i=1 u i g Γ [x i ] •k 2 (Γ l ) •k 2 ≤ 2 max k∈{1,...,p l } n i=1 u i g Γ [x i ] •k 2 p l k=1 (Γ l ) •k 2 ≤ 2 √ m n i=1 u i g Γ [x i ] ∞ |||Γ l ||| 2,1 , which implies in view of the definition of the effective noise in (8) 2 n i=1 g Γ [x i ] u i ≤ r * node |||Γ l ||| 2,1 . This inequality is our bound on the empirical process terms. We can combine the bound on the empiricial process term and the basic inequality to find n i=1 g * [x i ] -g Θ [x i ] 2 2 ≤ n i=1 g * [x i ] -g Θ [x i ] 2 2 + r * node ||| Θ l ||| 2,1 + r * node |||Θ l ||| 2,1 + r node |||Θ l ||| 2,1 -r node ||| Θ l ||| 2,1 . Using then the assumption r node ≥ r * node yields n i=1 g * [x i ] -g Θ [x i ] 2 2 ≤ n i=1 g * [x i ] -g Θ [x i ] 2 2 + 2r node |||Θ l ||| 2,1 . Multiplying both sides by 1/n and taking the infimum over Θ ∈ M 2,1 on the right-hand side then gives 1 n n i=1 g * [x i ] -g Θ [x i ] 2 2 ≤ inf Θ∈M2,1 1 n n i=1 g * [x i ] -g Θ [x i ] 2 2 + 2r node n |||Θ l ||| 2,1 . Invoking the definition of the prediction error on Page 4 gives the desired result. The proof for the connection-sparse estimator is virtually the same.



Figure 1: exemplary networks produced by the connection-sparse estimator (3) and the node-sparse estimator (6)

Rademacher random variables that are not connected with the statistical model at hand, while the effective noises involve the noise variables, which are completely specified by the model and, therefore, can have any distribution (see our sub-Gaussian example further below). Hence, there are no general techniques to relate Rademacher complexities and effective noises.Not only are the two concepts distinct, but also they are used in very different ways. For example, existing theories use Rademacher complexities to measure the size of the function class at hand, while we use effective noises to measure the maximal impact of the stochastic noise on the estimators. (Our proofs also require a measure of the size of the function class, but this measure is entropycf. Lemma 1.) In general, our proof techniques are very different from those in the context of Rademacher complexities.

tuning parameter, M gen any nonempty subset of M, and ||| • ||| any norm. The bound for such an estimator is then err[ Θ gen ] ≤ inf presence ( ) or absence ( ) of certain features in previous statistical theories for sparse deep learning for r gen ≥ r * gen , where r * gen is as r * con but based on the dual norm of ||| • ||| instead of the dual norm of ||| • ||| 1

The main difference is that one needs to use the || • || ∞ -and ||| • ||| 1 -norms (rather than the || • || 2 -and ||| • ||| op -norms) and the inequality ||Ab|| ∞ ≤ |||A||| 1 ||b|| ∞ (rather than the inequality ||Ab|| 2 ≤ |||A||| op ||b|| 2 ) to establish suitable contraction and Lipschitz properties.

A.4 PROOF OF PROPOSITION 1

In this section, we give a short proof of Proposition 1.Proof of Proposition 1. Verify the fact that if the all-zeros parameter is neither a solution of (3) nor of ( 5), all solutions Θ con and Θ con of (3) and ( 5), respectively, satisfy ( Θ con ) j , ( Θ con ) j = 0 p j+1 ×p j for all j ∈ {0, . . . , l}.It then follows from the assumed nonnegative homogeneity, r con > 0, and the definition of the estimator in (3) that |||( Θ con ) 0 ||| 1 , . . . , |||( Θ con ) l-1 ||| 1 = 1 for all solutions Θ con .Given a solution Θ con of (5), define a. . has the same value in the objective function as Θ con .

A.5 PROOF OF PROPOSITION 2

In this section, we establish a proof of Proposition 2. The key tools are the Lipschitz property of Proposition 4 and the entropy bounds of Lemma 1.Proof of Proposition 2. The main idea is to rewrite the event under consideration in a form that is amenable to known tail bounds for suprema of empirical processes with subgaussian random variables.The connection-sparse bound follows fromwhere we use in turn 1. the definition of r * con in (8), 2. the union bound, 3. van de Geer (2000, Corollary 8.3) and our Proposition 4 and Lemma 1, and 4. the inequality p l ≤ p = l j=0 p j+1 p j and consolidating the factors. The key concept underlying van de Geer (2000, Corollary 8.3 on Page 128) is chaining (van der Vaart & Wellner, 1996, Page 90) .The same considerations also apply to the node-sparse case, but we get an additional factor √ m from the definition of the effective noise in (8) and a factor √ p from the entropy bound in Lemma 1. The differences between the bounds for the connection-and node-sparse cases in terms of v ∞ vs. v 2 stem from the different Lipschitz constants in Proposition 4.

A.6 PROOF OF PROPOSITION 3

Proof of Proposition 3. The proof is based on standard empirical-process theory, including contraction and symmetrization arguments.Using basic algebra and measure theory, it is easy to show thatfor a constant c b ∈ (0, ∞) that depends only on b. The first term in this bound is the minimal risk as stated in the proposition, and the second term can be bounded by Corollary 1 and Proposition 2. Hence, it remains to bound the third term.In view of the law of large numbers, it is reasonable to hope for the third term to be small. But to make this precise, we have to keep in mind that the estimator itself depends on the input vectors. We, therefore, need to prepare the third term for the application of a uniform version of the law of large numbers. Using standard contraction arguments-see (Boucheron et al., 2013, Chapter 11 .3), for example-and Hölder's inequality, we can bound the third term by boundingwhich removes the dependence on the estimator Θ con up to the leading factor. To see that we can also neglect that factor, verify (see Proposition 2 and the proof of Theorem 1) that |||( Θ con ) l ||| 1 ≤ 2|||(Θ * ) l ||| 1 with high probability as long as r * con ≥ cv ∞ nl(log[2mnp]) 3 with c large enough. Consequently, we just need to consider the quantityThe last step is to bring this term in a form that is amenable to our earlier proofs. Using standard symmetrization arguments-see van der Vaart & Wellner (1996, Chapter 2.3), for example)-we can bound this quantity by boundingwhere k 1 , . . . , k n are i.i.d. Rademacher random variables. But even though k 1 , . . . , k n are i.i.d. Rademacher random variables, we do not resort to Rademacher complexities; instead, we use that Rademacher random variables are subgaussian, so that we can then proceed similarly as in the proof of Proposition 2.The node-sparse case can be treated along the same lines.

A.7 EXTENSIONS

Our proof approach disentangles the specifics of the objective function (proof of Theorem 1), of the network structure (proof of Proposition 4), and of the stochastic terms (proofs of Lemma 1 and Proposition 2). This feature allows one to generalize and extend the results of this paper in straightforward ways. For example, extensions to different noise distributions only need a corresponding version of Proposition 2-with everything else unchanged. One could envision, for example, using concentration inequalities for heavy-tailed distributions such as in Lederer & van de Geer (2014) . Extensions to different loss functions, to give another example, can be established by adjusting Theorem 1 accordingly. This can be done, for example, by invoking ideas from specialized literature on high-dimensional logistic regression such as Li & Lederer (2019) . We avoid going into further details to avoid digression; the key message is that the flexibility of the proofs is yet another advantage of our approach.

