SPARSIFYING NETWORKS VIA SUBDIFFERENTIAL IN-CLUSION

Abstract

Sparsifying deep neural networks is of paramount interest in many areas, especially when those networks have to be implemented on low-memory devices. In this article, we propose a new formulation of the problem of generating sparse weights for a neural network. By leveraging the properties of standard nonlinear activation functions, we show that the problem is equivalent to an approximate subdifferential inclusion problem. The accuracy of the approximation controls the sparsity. We show that the proposed approach is valid for a broad class of activation functions (ReLU, sigmoid, softmax). We propose an iterative optimization algorithm to induce sparsity whose convergence is guaranteed. Because of the algorithm flexibility, the sparsity can be ensured from partial training data in a minibatch manner. To demonstrate the effectiveness of our method, we perform experiments on various networks in different applicative contexts: image classification, speech recognition, natural language processing, and time-series forecasting.

1. INTRODUCTION

Deep neural networks have evolved to the state-of-the-art techniques in a wide array of applications: computer vision (Simonyan & Zisserman, 2015; He et al., 2016; Huang et al., 2017) , automatic speech recognition (Hannun et al., 2014; Dong et al., 2018; Li et al., 2019; Watanabe et al., 2018; Hayashi et al., 2019; Inaguma et al., 2020) , natural language processing (Turc et al., 2019; Radford et al., 2019; Dai et al., 2019b; Brown et al., 2020) , and time series forecasting (Oreshkin et al., 2020) . While their performance in various applications has matched and often exceeded human capabilities, neural networks may remain difficult to apply in real-world scenarios. Deep neural networks leverage the power of Graphical Processing Units (GPUs), which are power-hungry. Using GPUs to make billions of predictions per day, thus comes with a substantial energy cost. In addition, despite their quite fast response time, deep neural networks are not yet suitable for most real-time applications where memory-limited low-cost architectures need to be used. For all those reasons, compression and efficiency have become a topic of high interest in the deep learning community. Sparsity in DNNs has been an active research topic generating numerous approaches. DNNs achieving the state-of-the-art in a given problem usually have a large number of layers with non-uniform parameter distribution across layers. Most sparsification methods are based on a global approach, which may result in a sub-optimal compression for a reduced accuracy. This may occur because layers with a smaller number of parameters may remain dense, although they may contribute more in terms of computational complexity (e.g., for convolutional layers). Some methods, also known as magnitude pruning, use a hard or soft-thresholding to remove less significant parameters. Soft thresholding techniques achieve a good sparsity-accuracy trade-off at the cost of additional parameters and increased computation time during training. Searching for a hardware efficient network is another area that has been proven quite useful, but it requires a huge amount of computational resources. Convex optimization techniques such as those used in (Aghasi et al., 2017) often rely upon fixed point iterations that make use of the proximity operator (Moreau, 1962) . The related concepts are fundamental for tackling nonlinear problems and have recently come into play in the analysis of neural networks (Combettes & Pesquet, 2020a) and nonlinear systems (Combettes & Woodstock, 2020) . This paper shows that the properties of nonlinear activation functions can be utilized to identify highly sparse subnetworks. We show that the sparsification of a network can be formulated as an approximate subdifferential inclusion problem. We provide an iterative algorithm called subdifferential inclusion for sparsity (SIS) that uses partial training data to identify a sparse subnetwork while maintaining good accuracy. SIS makes even small-parameter layers sparse, resulting in models with significantly lower inference FLOPs than the baselines. For example, SIS for 90% sparse MobileNetV3 on ImageNet-1K achieves 66.07% top-1 accuracy with 33% fewer inference FLOPs than its dense counterpart and thus provides better results than the state-of-the-art method RigL. For non-convolutional networks like Transformer-XL trained on WikiText-103, SIS is able to achieve 70% sparsity while maintaining 21.1 perplexity score. We evaluate our approach across four domains and show that our compressed networks can achieve competitive accuracy for potential use on commodity hardware and edge devices.

2.1. INDUCING SPARSITY POST TRAINING

Methods inducing sparsity after a dense network is trained involve several pruning and fine-tuning cycles till desired sparsity and accuracy are reached (Mozer & Smolensky, 1989; LeCun et al., 1990; Hassibi et al., 1993; Han et al., 2015; Molchanov et al., 2017; Guo et al., 2016; Park et al., 2020) . (Renda et al., 2020) proposed weight rewinding technique instead of vanilla fine-tuning post-pruning. Net-Trim algorithm (Aghasi et al., 2017) removes connections at each layer of a trained network by convex programming. The proposed method works for networks using rectified linear units (ReLUs). Lowering rank of parameter tensors (Jaderberg et al., 2014; vahid et al., 2020; Lu et al., 2016) , removing channels, filters and inducing group sparsity (Wen et al., 2016; Li et al., 2017; Luo et al., 2017; Gordon et al., 2018; Yu et al., 2019; Liebenwein et al., 2020) are some methods that take network structure into account. All these methods rely on pruning and fine-tuning cycle(s) often from full training data.

2.2. INDUCING SPARSITY DURING TRAINING

Another popular approach has been to induce sparsity during training. This can be achieved by modifying the loss function to consider sparsity as part of the optimization (Chauvin, 1989; Carreira-Perpiñán & Idelbayev, 2018; Ullrich et al., 2017; Neklyudov et al., 2017) . Dynamically pruning during training (Zhu & Gupta, 2018; Bellec et al., 2018; Mocanu et al., 2018; Dai et al., 2019a; Lin et al., 2020b) by observing network flow. (Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2020; Evci et al., 2020) computes weight magnitude and reallocates weights at every step. Bayesian priors (Louizos et al., 2017) , L 0 , L 1 regularization (Louizos et al., 2018) , and variational dropout (Molchanov et al., 2017) get accuracy comparable to (Zhu & Gupta, 2018) but at a cost of 2× memory and 4× computations during training. (Liu et al., 2019; Savarese et al., 2020; Kusupati et al., 2020; Lee, 2019; Xiao et al., 2019; Azarian et al., 2020) have proposed learnable sparsity methods through training of the sparse masks and weights simultaneously with minimal heuristics. Although these methods are cheaper than pruning after training, they need at least the same computational effort as training a dense network to find a sparse sub-network. This makes them expensive when compressing big networks where the number of parameters ranges from hundreds of millions to billions (Dai et al., 2019b; Li et al., 2019; Brown et al., 2020) . (Frankle & Carbin, 2019) showed that it is possible to find sparse sub-networks that, when trained from scratch, were able to match or even outperform their dense counterparts. (Lee et al., 2019) presented SNIP, a method to estimate, at initialization, the importance that each weight could have later during training. In (Lee et al., 2020) the authors perform a theoretical study of pruning at initialization from a signal propagation perspective, focusing on the initialization scheme. Recently, (Wang et al., 2020) proposed GraSP, a different method based on the gradient norm after pruning, and showed a significant improvement for moderate levels of sparsity. (Ye et al., 2020) starts with a small subnetwork and progressively grow it to a subnetwork that is as accurate as its dense counterpart. (Tanaka et al., 2020) proposes SynFlow that avoids flow collapse of a pruned network during training. (Jorge et al., 2020) proposed FORCE, an iterative pruning method that progressively removes a small number of weights. This method is able to achieve extreme sparsity at little accuracy expense. These methods are not usable for big pre-trained networks and are expensive as multiple training rounds are required for different sparse models depending on deployment scenarios (computing devices).

2.4. EFFICIENT NEURAL ARCHITECTURE SEARCH

Hardware-aware NAS methods (Zoph et al., 2018; Real et al., 2019; Cai et al., 2018; Wu et al., 2019; Tan et al., 2019; Cai et al., 2019; Howard et al., 2019) directly incorporate the hardware feedback into efficient neural architecture search. (Cai et al., 2020) proposes to learn a single network composing of a large number of subnetworks from which a hardware aware subnetwork can be extracted in linear time. (Lin et al., 2020a) proposes a similar approach wherein they identify subnetworks that can be run efficiently on microcontrollers (MCUs). Our proposed algorithm applies to possibly large pre-trained networks. In contrast with methods presented in Section 2.1, ours can use a small amount of training data during pruning and fewer epochs during fine-tuning. As we will see in the next section, a key feature of our approach is that it is based on a fine analysis of the mathematical properties of activation functions, so allowing the use of powerful convex optimization tools that offer sound convergence guarantees.

3.1. VARIATIONAL PRINCIPLES

A basic neural network layer can be described by the relation: y = R(W x + b) where x ∈ R M is the input, y ∈ R N the output, W ∈ R N ×M is the weight matrix, b ∈ R N the bias vector, and R is a nonlinear activation operator from R N to R N . A key observation is that most of the activation operators currently used in neural networks are proximity operators of convex functions (Combettes & Pesquet, 2020a;b). We will therefore assume that there exists a proper lower-semicontinuous convex function f from R N to R ∪ {+∞} such that R = prox f . We recall that f is a proper lower-semicontinuous convex function if the area overs its graph, its epigraph (y, ξ) ∈ R N × R f (y) ξ , is a nonempty closed convex set. For such a function the proximity operator of f at z ∈ R N (Moreau, 1962) is the unique point defined as prox f (z) = argmin p∈R N 1 2 z -p 2 + f (p). It follows from standard subdifferential calculus that Eq. ( 1) can be re-expressed as the following inclusion relation: W x + b -y ∈ ∂f (y), where ∂f (y) is the Moreau subdifferential of f at y defined as ∂f (y) = t ∈ R N (∀z ∈ R N )f (z) f (y) + t | z -y . The subdifferential constitutes a useful extension of the notion of differential, which is applicable to nonsmooth functions. The set ∂f (y) is closed and convex and, if y satisfies Eq. ( 1), it is nonempty. The distance to this set of a point z ∈ R N is given by d ∂f (y) (z) = inf t∈∂f (y) z -t . We thus see that the subdifferential inclusion in Eq. ( 3) is also equivalent to d ∂f (y) (W x + b -y) = 0. Therefore, a suitable accuracy measure for approximated values of the layer parameters (W, b) is d ∂f (y) (W x + b -y).

3.2. OPTIMIZATION PROBLEM

Compressing a network consists of a sparsification of its parameters while keeping a satisfactory accuracy. Assume that, for a given layer, a training sequence of input/output pairs is available which results from a forward pass performed on the original network for some input dataset of length K. The training sequence is split in J minibatches of size T so that K = JT . The j-th minibatch with j ∈ {1, . . . , J} is denoted by (x j,t , y j,t ) 1 t T . In order to compress the network, we propose to solve the following constrained optimization problem. Problem 1 We want to minimize (W,b)∈C g(W, b) with C = (W, b) ∈ R N ×M × R N (∀j ∈ {1, . . . , J}) T t=1 d 2 ∂f (yj,t) (W x j,t + b -y j,t ) T η , ( ) where g is a sparsity measure defined on R N ×M × R N and η ∈ [0, +∞[ is some accuracy tolerance. Since, for every j ∈ {1, . . . , J}, the function (W, b) → T t=1 d 2 ∂f (yj,t) (W x j,t + b -y j,t ) is continuous and convex, C is a closed and convex subset of R N ×M × R N . In addition, this set is nonempty when there exist W ∈ R N ×M and b ∈ R N such that, for every j ∈ {1, . . . , J} and t ∈ {1, . . . , T }, d 2 ∂f (yj,t) (W x j,t + b -y j,t ) = 0. As we have seen in Section 3.1, this condition is satisfied when (W , b) are the parameters of the uncompressed layer. Often, the sparsity of the weight matrix is the determining factor whereas the bias vector represents a small number of parameters, so that we can make the following assumption. Assumption 2 For every W ∈ R N ×M and b ∈ R N , g(W, b) = h(W ) where h is a function from R N ×M to R ∪ {+∞}, which is lower-semicontinuous, convex, and coercive (i.e. lim W F→+∞ h(W ) = +∞). In addition, there exists (W , b) ∈ C such that h(W ) < +∞ and there exists (j * , t * ) ∈ {1, . . . , J} × {1, . . . , T } such that y j * ,t * lies in the interior of the range of R. Under this assumption, the existence of a solution to Problem 1 is guaranteed (see Appendix A). A standard choice for such a function is the 1 -norm of the matrix elements, h = • 1 , but other convex sparsity measures could also be easily incorporated within this framework, e.g. group sparsity measures. Another point worth being noticed is that constraints other than (8) could be imposed. For example, one could make the following alternative choice for the constraint set C = (W, b) ∈ R N ×M × R N sup j∈{1,...,J},t∈{1,...,T } d ∂f (yj,t) (W x j,t + b -y j,t ) √ η . (9) Although the resulting optimization problem could be tackled by the same kind of algorithm as the one we will propose, Constraint (8) leads to a simpler implementation.

3.3. OPTIMIZATION ALGORITHM

A standard proximal method for solving Problem 1 is the Douglas-Rachford algorithm (Lions & Mercier, 1979; Combettes & Pesquet, 2007) . This algorithm alternates between a proximal step aiming at sparsifying the weight matrix and a projection step allowing a given accuracy to be reached. Assume that a solution to Problem 1 exists. Then, this algorithm reads as shown on the top of the next page. The Douglas-Rachford algorithm uses parameters γ ∈ ]0, +∞[ and Algorithm 1: Douglas-Rachford algorithm for network compression (λ n ) n∈N in ]0, 2[ such that n∈N λ n (2 -λ n ) = Initialize : W 0 ∈ R N ×M and b 0 ∈ R N for n = 0, 1, . . . do W n = prox γh ( W n ) ( W n , b n ) = proj C (2W n -W n , b n ) W n+1 = W n + λ n ( W n -W n ) b n+1 = b n + λ n ( b n -b n ). The proximity operator of function γh has a closed-form for standard choices of sparsity measuresfoot_0 . For example, when h = • 1 , this operator reduces to a soft-thresholding (with threshold value γ) of the input matrix elements. In turn, since the convex set C has an intricate form, an explicit expression of proj C does not exist. Finding an efficient method for computing this projection for large datasets thus constitutes the main challenge in the use of the above Douglas-Rachford strategy, which we will discuss in the next section.

3.4. COMPUTATION OF THE PROJECTION ONTO THE CONSTRAINT SET

For every mini-batch index j ∈ {1, . . . , J}, let us define the following convex function: (∀(W, b) ∈ R N ×M × R N ) c j (W, b) = T t=1 d 2 ∂f (yj,t) (W x j,t + b -y j,t ) -T η. Note that, for every j ∈ {1, . . . , J}, function c j is differentiable and its gradient at (W, b) ∈ R N ×M × R N is given by ∇c j (W, b) = (∇ W c j (W, b), ∇ b c j (W, b)), where ∇ W c j (W, b) = 2 T t=1 e j,t x j,t , ∇ b c j (W, b) = 2 T t=1 e j,t with (∀t ∈ {1, . . . , T }) e j,t = W x j,t + b -y j,t -proj ∂f (yj,t) (W x j,t + b -y j,t ). A pair of weight/bias parameters belongs to C if and only if it lies in the intersection of the 0-lower level sets of the functions (c j ) 1 j J . To compute the projection of some (W, b) ∈ R N ×M × R N onto this intersection, we use Algorithm 2 ( • F denotes here the Frobenius norm). This iterative algorithm has the advantage of proceeding in a minibatch manner. It allows us to choose the mini-batch index j n at iteration n in a quasi-cyclic manner. The simplest rule is to activate each minibatch once within J successive iterations of the algorithm so that they correspond to an epoch. The proposed algorithm belongs to the family of block-iterative outer approximation schemes for solving constrained quadratic problems, which was introduced in (Combettes, 2003). The convergence of the sequence (W n , b n ) n∈N generated by Algorithm 2 to proj C (W, b) is thus guaranteed. One of the main features of the algorithm is that it does not require to perform any projection onto the 0-lower level sets of the functions c j , which would be intractable due to their expressions. Instead, these projections are implicitly replaced by subgradient projections, which are much easier to compute in our context.

3.5. DEALING WITH VARIOUS NONLINEARITIES

For any choice of activation operator R, we have to calculate the projection onto ∂f (y) for every vector y satisfying Eq. ( 1). This projection is indeed required in the computation of the gradients of functions (c j ) 1 j J , as shown by Eq. ( 13). Two properties may facilitate this calculation. First, if f is differentiable at y, then ∂f (y) reduces to a singleton containing the gradient ∇f (y) of 12) and Eq. ( 13) δW n = cj n (Wn,bn) ∇ W cj n (Wn,bn) ∇ W cj n,n (Wn,bn) 2 F + ∇ b cj n (Wn,bn) 2 δb n = cj n (Wn,bn) ∇ b cj n (Wn,bn) ∇ W cj n ,n (Wn,bn) 2 F + ∇ b cj n (Wn,bn) 2 π n = tr((W 0 -W n ) δW n ) + (b 0 -b n ) δb n µ n = W 0 -W n 2 F + b 0 -b n 2 ν n = δW n 2 F + δb n 2 ζ n = µ n ν n -π 2 n if ζ n = 0 and π n 0 then W n+1 = W n -δW n b n+1 = b n -δb n else if ζ n > 0 and π n ν n ζ n then W n+1 = W 0 -(1 + πn νn )δW n b n+1 = b 0 -(1 + πn νn )δb n else W n+1 = W n + νn ζn (π n (W 0 -W n ) -µ n δW n ) b n+1 = b n + νn ζn (π n (b 0 -b n ) -µ n δb n ) else W n+1 = W n b n+1 = b n f at y, so that, for every z ∈ R N , proj ∂f (y) (z) = ∇f (y). Second, R is often separable, i.e. consists of the application of a scalar activation function ρ : R → R to each component of its input argument. According to our assumptions, there thus exists a proper lower-semicontinuous convex function ϕ from R to R ∪ {+∞} such that ρ = prox ϕ and, for every k) ). This implies that, for every z z = (ζ (k) ) 1 k N ∈ R N , f (z) = N k=1 ϕ(ζ ( = (ζ (k) ) 1 k N ∈ R N , proj ∂f (y) (z) = (proj ∂ϕ(υ (k) ) (ζ (k) )) 1 k N , where the components of y are denoted by (υ (k) ) 1 k N . Based on these properties, a list of standard activation functions ρ is given in Table 1 , for which we provide the associated expressions of the projection onto ∂ϕ. The calculations are detailed in Appendix B. An example of non-separable activation operator frequently employed in neural network architectures is the softmax operation defined as: (∀z = (ζ (k) ) 1 k N ∈ R N ) R(z) = exp(ζ (k) ) N j=1 exp(ζ (j) ) 1 k N . It is shown in Appendix C that, for every y = (υ (k) ) 1 k N in the range of R, (∀z ∈ R N ) proj ∂f (y) (z) = Q(y) + 1 (z -Q(y)) N 1, where 1 = [1, . . . , 1] ∈ R N and Q(y) = (ln υ (k) + 1 -υ (k) ) 1 k N .

3.6. SIS ON MULTI-LAYERED NETWORKS

Algorithm 3 describes how we make use of SIS for a multi-layered neural network. We use a pretrained network and part of the training sequence to extract layer-wise input-output features. Then we apply SIS on each individual layer l by passing η, layer parameters (W (l) , b (l) ) and extracted input-output features (Y (l-1) , Y (l) ) to Algorithm 1. The benefit of applying SIS to each layer independently is that we can apply SIS on all the layers of a network in parallel. This reduces the time required to process the whole network and compute resources are optimally utilized.

Name

ρ(ζ) ρ(ζ) ρ(ζ) proj ∂ϕ(υ) (ζ) proj ∂ϕ(υ) (ζ) proj ∂ϕ(υ) (ζ) Sigmoid (1 + e -ζ ) -1 -1 2 ln(υ + 1/2) -ln(υ -1/2) -υ Arctangent (2/π) arctan(ζ) tan(πυ/2) -υ ReLU max{ζ, 0} 0 if υ > 0 or ζ 0 ζ otherwise Leaky ReLU ζ if ζ > 0 αζ otherwise 0 if υ > 0 (1/α -1)υ otherwise Capped ReLU ReLU α (ζ) = min{max{ζ, 0}, α}    ζ if (υ = 0 and ζ < 0) or (υ = α and ζ > 0) 0 otherwise ELU ζ if ζ 0 α exp(ζ) -1 otherwise 0 if υ > 0 ln υ+α α -υ otherwise QuadReLU (ζ + α)ReLU 2α (ζ + α) 4α        υ if υ = 0 and ζ -α -υ + 2 √ αυ -α if υ ∈]0, α] or (υ = 0 and ζ > -α) υ -α otherwise Table 1: Expression of proj ∂ϕ(υ) (ζ) for ζ ∈ R and υ in the range of ρ, for standard activation functions ρ. α is a positive constant. Algorithm 3: Parallel SIS for multi-layered network Input: input sequence X ∈ R M ×K , compression parameter η > 0, weight matrices W (1) , . . . , W (L) , and bias vectors b (1) , . . . , b (L) Y (0) ← X for l = 1, . . . , L do l) , Y (l) , Y (l-1) ) Y (l) = R (W (l) Y (l-1) + b (l) ) W (l) , b (l) ← SIS(η, W (l) , b Output: W (1) , . . . , W (L) and b (1) , . . . , b (L) 

4. EXPERIMENTS

In this section, we conduct various experiments to validate the effectiveness of SIS in terms of the test accuracy vs. sparsity and inference time FLOPs vs. sparsity by comparing against RigL (Evci et al., 2020) . We also include SNIP (Lee et al., 2019) , GraSP (Wang et al., 2020) , SynFlow (Tanaka et al., 2020) , STR (Kusupati et al., 2020), and FORCE (Jorge et al., 2020) . These methods start training from a sparse network and have some limitations when compared to methods that prune a pretrained network Blalock et al. (2020) ; Gale et al. (2019) . For a fair comparison we also include LRR (Renda et al., 2020) which uses a pretrained network and multiple rounds of pruning and retraining by leveraging learning rate rewinding. The experimental setup is described in Appendix D.

4.1. MODERN CONVNETS ON CIFAR AND IMAGENET

We compare SIS with competitive baselines on CIFAR-10/100 for three different sparsity regimes 90%, 95%, 98%, and the results are listed in (Tanaka et al., 2020 ) 93.35 93.45 92.24 71.77 71.72 70.94 STR (Kusupati et al., 2020 ) 93.73 93.27 92.21 71.93 71.14 69.89 FORCE (Jorge et al., 2020) 93.87 93.30 92.25 71.9 71.73 70.96 LRR (Renda et al., 2020) 94.03 93.53 91.73 72.12 71.36 70.39 RigL (Evci et al., 2020) 93 Due to its small size and controlled nature, CIFAR-10/100 may not appear sufficient to draw solid conclusions. We thus conduct further experiments on ImageNet using ResNet50 and MobileNets. Table 3 shows that, in the case of ResNet50, LRR performs marginally better than SIS at 60% sparsity. At 80%, 90%, and 96.5% sparsity SIS outperforms all other methods. For all sparsity regimes, SIS achieves least inference FLOPs. RigL achieves these results in same training time as SIS with less training time FLOPs. This may be related to the fact that SIS can achieve better compression in the last layer before SoftMax. MobileNets are compact architectures designed specifically for resource-constrained devices. Table 4 shows results for RigL and SIS on MobileNets. We observe that SIS outperforms all MobileNet versions at 75% sparsity level. For a 90% sparsity level, SIS outperforms RigL for MobileNet V1 and V3 whereas, for MobileNetV2, RigL performs slightly better than SIS at 90% sparsity level. In all the cases, we can see that the resulting SIS sparse network uses fewer FLOPs than RigL. A possible explanation for this fact is that SIS leverages activation function properties during the sparsification process. Jasper on LibriSpeech. Jasper is a speech recognition model that uses 1D convolutions. The trained network is a 333 million parameter model and has a word error rate (WER) of 12.2 on the test set. We apply SIS on this network and compare it with RigL and SNIP in terms of sparsity. Table 5 reports WER and inference FLOPs for all three methods. SIS marginally performs better than SIS on this task in terms of WER and FLOPs for 70% sparsity. The main advantage of our approach lies in the fact that we can use a single pre-trained Jasper network and achieve different sparsity level for different types of deployment scenarios with less computational resources than RigL.

Network

Transformer-XL on WikiText-103. Transformer-XL is a language model with 246 million parameters. The trained network on WikiText-103 has a perplexity score (PPL) of 18.6. In Table 5 , we see that SIS performs better than SNIP and RigL in terms of PPL and has 68% fewer inference FLOPs. This is due to the fact that large language models can be efficiently trained and then compressed easily, but training a sparse sub-network from scratch is hard (Li et al., 2020) , as is the case with SNIP and RigL. SNIP uses one-shot pruning to obtain a random sparse sub-network, whereas RigL is able to change its structure during training, which allows it to perform better than SNIP. N-BEATS on M4. N-BEATS is a very deep residual fully-connected network to perform forecasting in univariate time-series problems. It is a 14 million parameter network. The Symmetric Mean Absolute Percentage Error (SMAPE) of the dense network on the M4 dataset is 8.3%. We apply SIS on this network and compare its performance with respect to RigL and SIS. SIS performs better than both methods and results in 65% fewer inference FLOPs.

5. CONCLUSION

In this article, we have proposed a novel method for sparsifying neural networks. The compression problem for each layer has been recast as the minimization of a sparsity measure under accuracy constraints. This constrained optimization problem has been solved by means of advanced convex optimization tools. The resulting SIS algorithm is i) reliable in terms of iteration convergence guarantees, ii) applicable to a wide range of activation operators, and iii) able to deal with large datasets split into mini-batches. Our numerical tests demonstrate that the approach is not only appealing from a theoretical viewpoint but also practically efficient.

APPENDICES A EXISTENCE OF A SOLUTION TO PROBLEM 1 UNDER ASSUMPTION 2

Under Assumption 2, Problem 1 is equivalent to minimize (W,b)∈C h(W ) (15) with C = (W, b) ∈ R N ×M × R N max j∈{1,...,J} c j (W, b) 0 , where the functions (c j ) 1 j J are defined in Eq. ( 10). These functions being convex, Φ = max j∈{1,...,J} c j is convex (Bauschke & Combettes, 2019, Proposition 8.16 ). We deduce that  Ψ = inf b∈R N Φ(•, W ∈lev 0 Ψ h(W ) where lev 0 Ψ is the 0-lower level set of Ψ defined as lev 0 Ψ = W ∈ R N ×M Ψ(W ) 0 , Ψ being both convex and continuous, lev 0 Ψ is closed and convex. According to Assumption 2, there exists (W , b) ∈ R N ×M × R N such that h(W ) < +∞ and Φ(W , b) 0, which implies that Ψ(W ) 0. This shows that lev 0 Ψ has a nonempty intersection with the domain of h. By invoking now the coercivity property of h, the existence of a solution W to Problem ( 17) is guaranteed by standard convex analysis results (Bauschke & Combettes, 2019, Theorem 11.10). To show that ( W , b) is a solution to (15), it is sufficient to show that there exists b ∈ R N such that Φ( W , b) = Ψ( W ). This is equivalent to prove that there exists a solution b to the problem: minimize b∈R N Φ( W , b). We know that Φ( W , •) is a continuous function. In addition, we have assumed that there exists (j * , t * ) ∈ {1, . . . , J} × {1, . . . , T } such that y j * ,t * is an interior point of R(R N ), which is also equal to the domain of ∂f and a subset of the domain of f . Since f is continuous on the interior of its domain, ∂f (y j * ,t * ) is bounded (Bauschke & Combettes, 2019, Proposition 16.17(ii)). Then d ∂f (y j * ,t * ) is coercive, hence c j * ( W , •) is coercive, and so is Φ( W , •) c j * ( W , •). The existence of b thus follows from the Weierstrass theorem.

B RESULTS IN TABLE 1

The results are derived from the expression of the convex function ϕ associated with each activation function ρ (Combettes & Pesquet, 2020a, Section 2.1) (Combettes & Pesquet, 2020b, Section 3.2). Sigmoid (∀ζ ∈ R) ϕ(ζ) =            (ζ + 1/2) ln(ζ + 1/2) + (1/2 -ζ) ln(1/2 -ζ) - 1 2 (ζ 2 + 1/4) if |ζ| < 1/2 -1/4 if |ζ| = 1/2 +∞ if |ζ| > 1/2. ( ) The range of the Sigmoid function is ] -1/2, 1/2[ and the above function is differentiable on this interval and its derivative at every υ ∈] -1/2, 1/2[ is ϕ (υ) = ln(υ + 1/2) -ln(υ -1/2) -υ. We deduce that, for every ζ ∈ R, proj ∂ϕ(υ) (ζ) = ϕ (υ). where ι C∩A denotes the indicator function of the intersection of C and A (equal to 0 on this set and +∞ elsewhere). It then follows from standard subdifferential calculus rules that, for every y = (υ (k) ) 1 k N ∈ R N , ∂f (y) = ( φ (υ (k) )) 1 k N + N C (y) + N A (y), where φ is the derivative of φ and N D denotes the normal cone to a nonempty closed convex set D, which is defined as N D (y) = t ∈ R N (∀z ∈ D) t | z -y 0 . Thus N A (y) = N V (y) is the orthogonal space V ⊥ of V . Let us now assume that y ∈ R(R N ) ⊂]0, 1[ N . Then, since y is an interior point of C, N C (y) = {0}. We then deduce from Eq. ( 43) that ∂f (y) = Q(y) + V ⊥ , where Q(y) = (ϕ (υ (k) )) 1 k N = (ln υ (k) + 1 -υ (k) ) 1 k N . It follows that, for every z ∈ R N , proj ∂f (y) (z) = Q(y) + proj V ⊥ (z -Q(y)). By using the expression of the projection proj V = Id -proj V ⊥ onto hyperplane V , we finally obtain proj ∂f (y) (z) = Q(y) + 1 (z -Q(y)) N 1.

D EXPERIMENTAL SETUP

PyTorch is employed to implement our method. We use and extend SNIP and RigL code available here 2 , LRR 3 , GraSP 4 , SynFlow 5 , STR 6 , and FORCE 7 . In order to manage our experiments we use Polyaxon 8 on a Kubernetes 9 cluster and use five computing nodes with eight V100 GPUs each. Floating point operations per second (FLOPs) is calculated as equal to one multiply-add accumulator using the code 10 . SIS has the following parameters: number of iterations of Algorithm 1, number of iterations of Algorithm 2, step size parameter γ in Algorithm 1, constraint bound parameter η used to control the sparsity, and relaxation parameter λ n ≡ λ of Algorithm 1. In our experiments, the maximum numbers of iterations of Algorithm 1 and Algorithm 2 are set to 2000 and 1000, respectively. λ is set to 1.5 and γ is set to 0.1 for all the SIS experiments. η value depends on the network and dataset. With few experiments, we search for a good η value that gives suitable sparsity and accuracy. VGG19 and ResNet50 on CIFAR-10/100. We train VGG19 on CIFAR-10 for 160 epochs with a batch size of 128, learning rate of 0.1 and weight decay of 5 × 10 -4 applied at epochs 81 and 122. A momentum of 0.9 is used with stochastic gradient descent (SGD). We make use of 1000 images per training class when using SIS. We fine-tune the identified sparse subnetwork for 10 epochs at a learning rate of 10 -3 . For CIFAR-100 we keep the same training hyperparameters as for CIFAR-10. When applying SIS to the dense network, we use 300 images per class from the training samples. We fine-tune the identified sparse subnetwork for 40 epochs on the training set with a learning rate of 10 -3 . ResNet50 employs the same hyperparameters as VGG19, except the weight decay that we set to 10 -4 . When applying SIS to train dense ResNet50, we use the same partial training set and the same hyperparameters during fine-tuning. In case of VGG19 for CIFAR-10 and CIFAR-100, we found that η values in range (1.5, 2) works best for sparsity range (90%, 98%). In case of ResNet50, η values in range (1, 2) is used. The η parameter in our algorithm controls the accuracy tolerance. The higher, the more tolerant we are on the loss of precision and the sparser the network is. Thus, this parameter also controls the network sparsity. The choice of this parameter should be the result of an accuracy-sparsity trade-off. This is illustrated in Figure 2 .



http://proximity-operator.net



+∞. Throughout this article, proj S denotes the projection onto a nonempty closed convex set S. Under these conditions, the sequence (W n , b n ) n∈N generated by Algorithm 1 is guaranteed to converge to a solution to Problem 1 if there exists (W , b) ∈ C such W is a point in the relative interior of the domain of h Combettes & Pesquet (2007) (see illustrations in Appendix E).

Minibatch algorithm for computing proj C (W, b) Initialize :W 0 = W and b 0 = b for n = 0, 1, . . . do Select a batch of index j n ∈ {1, . . . , J} if c jn (W n , b n ) > 0 then Compute ∇ W c jn (W n , b n ) and ∇ b c jn (W n , b n ) by using Eq. (

Figure 1: Convergence of SLIC: Top row shows the first layer (ReLU activated) and bottom row shows the last layer (softmaxed) in LeNet-FCN. (a) and (d) shows the evolution of the maximum value c max of the constraint functions (c j ) 1 j J , (b) and (e) shows the evolution of W 1 in Algorithm 1 iterations. (c) and (f) shows W 1 evolution in Algorithm 2.

Figure 2: Effect of η on LeNet-FCN



Test accuracy of sparse VGG19 and ResNet50 on CIFAR-10 and CIFAR-100 datasets.

Test Top-1 accuracy and inference FLOPs of sparse ResNet50 on ImageNet where baseline accuracy and inference FLOPs are 77.37% and 4.14G, respectively.

Test accuracy and inference FLOPs of sparse MobileNet versions using RigL and SIS on ImageNet, baseline accuracy and inference FLOPs shown in brackets.

Test accuracy and inference FLOPs of JASPER, Transformer-XL, and N-BEATS at 70% sparsity.

b) is also a convex function (Bauschke & Combettes, 2019, Proposition 8.35). Since Φ -ηT , Ψ is finite valued. It is thus continuous on R N ×M (Bauschke & Combettes, 2019, Corollary 8.40). Let us now consider the problem: minimize

Arctangent

By proceeding for this function similarly to the Sigmoid function, we have, for every υ ∈ ρReLUFor every υ ∈ ρ(R) = [0, +∞[, we haveWe deduce thatLeaky ReLUSince this function is differentiable on R, for every υ ∈ R,Capped ReLUWe have thus for every υ ∈ [0, α],This leads toThis function being differentiable on ρ(R) =] -α, +∞[, we have for every υ ∈] -α, +∞[,QuadReLU Unlike the previous ones, this function does not seem to have been investigated before.It can be seen as a surrogate to the hard swish activation function, which is not a proximal activation function. Let us defineϕ is a lower-semicontinuous convex function whose subdifferential isFrom the definition of the proximity operator, for every(36)In addition, for every υ ∈ [0, +∞[, it follows from Eq. ( 35) that the projection onto ∂f (υ) is

C SOFTMAX ACTIVATION

Let C denote the closed hypercube [0, 1] N , let V be the vector hyperplane defined asand let A be the affine hyperplane defined aswherewhere(with the convention 0 ln 0 = 0). The latter function can be extended on R, say by a quadratic function on ] -∞, 0[, yielding a convex function φ which is differentiable on R. We have then

