LAYER SPARSITY IN NEURAL NETWORKS

Abstract

Sparsity has become popular in machine learning, because it can save computational resources, facilitate interpretations, and prevent overfitting. In this paper, we discuss sparsity in the framework of neural networks. In particular, we formulate a new notion of sparsity that concerns the networks' layers and, therefore, aligns particularly well with the current trend toward deep networks. We call this notion layer sparsity. We then introduce corresponding regularization and refitting schemes that can complement standard deep-learning pipelines to generate more compact and accurate networks.

1. INTRODUCTION

The number of layers and the number of nodes in each layer are arguably among the most fundamental parameters of neural networks. But specifying these parameters can be challenging: deep and wide networks, that is, networks with many layers and nodes, can describe data in astounding detail, but they are also prone to overfitting and require large memory, CPU, energy, and so forth. The resource requirements can be particularly problematic for real-time applications or applications on fitness trackers and other wearables, whose popularity has surged in recent years. A promising approach to meet these challenges is to fit networks sizes adaptively, that is, to allow for many layers and nodes in principle, but to ensure that the final network is "simple" in that it has a small number of connections, nodes, or layers (Changpinyo et al., 2017; Han et al., 2016; Kim et al., 2016; Liu et al., 2015; Wen et al., 2016) . Popular ways to fit such simple and compact networks include successively augmenting small networks (Ash, 1989; Bello, 1992) , pruning large networks (Simonyan & Zisserman, 2015) , or explicit sparsity-inducing regularization of the weight matrices, which we focus on here. An example is the 1 -norm, which can reduce the number of connections. Another example is the 1 -norm grouped over the rows of the weight matrices, which can reduce the number of nodes. It has been shown that such regularizers can indeed produce networks that are both accurate and yet have a small number of nodes and connections either in the first layer (Feng & Simon, 2017) or overall (Alvarez & Salzmann, 2016; Liu et al., 2015; Scardapane et al., 2017) . Such sparsity-inducing regularizers also have a long-standing tradition and thorough theoretical underpinning in statistics (Hastie et al., 2015) . But while sparsity on the level of connections and nodes has been studied in some detail, sparsity on the level of layers is much less understood. This lack of understanding contrasts the current trend to deep network architectures, which is supported by state-of-the-art performances of deep networks (LeCun et al., 2015; Schmidhuber, 2015) , recent approximation theory for ReLU activation networks (Liang & Srikant, 2016; Telgarsky, 2016; Yarotsky, 2017) , and recent statistical theory (Golowich et al., 2017; Kohler & Langer, 2019; Taheri et al., 2020) . Hence, a better understanding of sparsity on the level of layers seems to be in order. Therefore, we discuss in this paper sparsity with a special emphasis on the networks' layers. Our key observation is that for typical activation functions such as ReLU, a layer can be removed if all its parameter values are non-negative. We leverage this observation in the development of a new regularizer that specifically targets sparsity on the level of layers, and we show that this regularizer can lead to more compact and more accurate networks. 1. We introduce a new notion of sparsity that we call layer sparsity. 2. We introduce a corresponding regularizer that can reduce network sizes. 3. We introduce an additional refitting step that can further improve prediction accuracies. In Section 2, we specify our framework, discuss different notions of sparsity, and introduce our refitting scheme. In Section 3, we establish a numerical proof of concept. In Section 4, we conclude with a discussion.

2. SPARSITY IN NEURAL NETWORKS

We first state our framework, then discuss different notions of sparsity, and finally introduce a refitting scheme.

2.1. MATHEMATICAL FRAMEWORK

To fix ideas, we first consider fully-connected neural networks that model data according to y i = f 1 W 1 f 2 ...f l [W l x i ] + u i , where i ∈ {1, . . . , n} indexes the n different samples, y i ∈ R is the output, x i ∈ R d is the corresponding input with d the input dimension, l is the number of layers, W j ∈ R pj ×pj+1 for j ∈ {1, . . . , l} are the weight matrices with p 1 = 1 and p l+1 = d, f j : R pj → R pj for j ∈ {1, . . . , l} are the activation functions, and u i ∈ R is the random noise. Extensions beyond fully-connected networks are straightforward-see Section 2.5. We summarize the parameters in W := (W 1 , . . . , W l ) ∈ V := {V = (V 1 , . . . , V l ) : V j ∈ R pj ×pj+1 }, and we write for ease of notation f V [x i ] := f 1 V 1 f 2 ...f l [V l x i ] for V ∈ V. Neural networks are usually fitted based on regularized estimators in Lagrange W ∈ argmin V ∈V DataFit[y 1 , . . . , y n , x 1 , . . . , x n ] + h[V ] (3) or constraint form W ∈ argmin V ∈V h[V ]≤1 DataFit[y 1 , . . . , y n , x 1 , . . . , x n ] , where DataFit : R n × R n×d is a data-fitting function such as least-squares n i=1 (y i -f V [x i ]) 2 , and h : V → [0, ∞) is a regularizer such as the elementwise 1 -norm j,k,l |(V j ) kl |. We are particularly interested in regularizers that induce sparsity.

2.2. STANDARD NOTIONS OF SPARSITY

We first state two regularizers that are known in deep learning and the corresponding notions of sparsity. where r C ∈ [0, ∞) l is a vector of tuning parameters. This regularizer is the deep learning equivalent of the lasso regularizer in linear regression (Tibshirani, 1996) and has received considerable attention



Consider the vanilla 1 -regularizerh C [V ] := l j=1 (r C ) j |||V j ||| 1 := j ) vw | ,

