LILNETX: LIGHTWEIGHT NETWORKS WITH EXTREME MODEL COMPRESSION AND STRUCTURED SPARSIFICATION

Abstract

We introduce LilNetX, an end-to-end trainable technique for neural networks that enables learning models with specified accuracy-compression-computation tradeoff. Prior works approach these problems one at a time and often require postprocessing or multistage training. Our method, on the other hand, constructs a joint training objective that penalizes the self-information of network parameters in a latent representation space to encourage small model size, while also introducing priors to increase structured sparsity in the parameter space to reduce computation. When compared with existing state-of-the-art model compression methods, we achieve up to 50% smaller model size and 98% model sparsity on ResNet-20 on the CIFAR-10 dataset as well as 31% smaller model size and 81% structured sparsity on ResNet-50 trained on ImageNet while retaining the same accuracy as these methods. The resulting sparsity can improve the inference time by a factor of almost 1.86× in comparison to a dense ResNet-50 model. Code is available at https://github.com/Sharath-girish/LilNetX.

1. INTRODUCTION

Frankle & Carbin (2018) Lin et al. (2019) Reduction in Model Size % Reduction in FLOPs Figure 1 : Our method jointly optimizes for size on disk and structured sparsity. We compare various approaches using ResNet-50 architecture on ImageNet and plot FLOPs (y-axis) vs. size (x-axis) for models with similar accuracy. Prior model compression methods optimize for either quantization (■) or pruning (▲) objectives. Our approach, LilNetX, enables training while optimizing for both compression (model size) as well as computation (structured sparsity). Refer Table 1 for details. Recent research in deep neural networks (DNNs) has shown that large performance gains can be achieved on a variety of real world tasks simply by employing larger parameter-heavy and computationally intensive architectures (He et al., 2016; Dosovitskiy et al., 2020) . However, as DNNs proliferate in the industry, they often need to be trained repeatedly, transmitted over the network to different devices, and need to perform under hardware constraints with minimal loss in accuracy, all at the same time. Hence, finding ways to reduce the storage size of the models on the devices while simultaneously improving their run-time is of utmost importance. This paper proposes a general-purpose neural network training framework to jointly optimize the model parameters for accuracy, the model size on the disk, and computation, on any given task. Over the last few years, research on training smaller and efficient DNNs has followed two seemingly parallel tracks with different goals. One line of work focuses on model compression to deal with storage and communication network bottlenecks when deploying big models or a large number of small models. While they achieve high levels of compression in terms of memory, their focus is not on reducing computation. These works either require additional algorithms with some form of post hoc training (Yeom et al., 2021) or quantize the network parameters at the cost of network performance (Courbariaux et al., 2015; Li et al., 2016) . The other line of work focuses on reducing computation through various model pruning techniques (Han et al., 2015; Frankle & Carbin, 2018; Evci et al., 2020) . Their focus is to decrease the number of Floating Point Operations (FLOPs) of the network at inference time, while still achieving some compression due to fewer parameters. Typically, the cost of storing these pruned networks on disk is much higher than dedicated model compression works. In this work, we bridge the gap between the two lines of work and show that it is indeed possible to train a neural network while jointly optimizing for both the compression to reduce disk space as well as structured sparsity to reduce computation (Fig. 1 ). We maintain quantized latent representations for the model weights and penalize the entropy of these latents. This idea of reparameterized quantization (Oktay et al., 2020) is extremely effective in reducing the effective model size on the disk. However, it requires the full dense model during the inference. To address this shortcoming, we introduce priors to encourage structured and unstructured sparsity in the representations along with key design changes. Our priors reside in the latent representation space while encouraging sparsity in the model space. More specifically, we use the notion of slice sparsity, a form of structured sparsity where a K × K slice is fully zero for a convolutional kernel of size K and C channels. Unlike unstructured sparsity which has irregular memory access and offers a little practical speedup, slicestructured sparsity allows for removing entire kernel slices per filter, thus reducing channel size for the convolution of each filter. Additionally, it is more fine-grained than fully structured channel/filter sparsity works (He et al., 2017; Mao et al., 2017) which typically lead to accuracy drops. Extensive experimentation on three standard datasets shows that our framework achieves high levels of structured sparsity in the trained models. Additionally, the introduced priors show gains even in model compression compared to previous state-of-the-art. By varying the weight of the priors, we establish a trade-off between model size, sparsity, and accuracy. Along with model compression, we achieve inference speedups by exploiting the sparsity in the trained models. We dub our method LilNetX -Lightweight Networks with EXtreme Compression and Structured Sparsification. Our contributions are summarized below. • We introduce LilNetX, an algorithm to jointly perform model compression and structured sparsification for direct computational gains in network inference. Our algorithm can be trained end-toend using a single joint optimization objective without any post-hoc training or post-processing. • With extensive ablation studies and results, we show the effectiveness of our approach while outperforming existing approaches in both model compression and pruning, in most network and dataset setups, obtaining inference speedups in comparison to the dense baselines.

2. RELATED WORK

Typical model compression methods usually follow some form of quantization, parameter pruning, or both. Both lines of work focus on reducing the size of the model on the disk, and/or increasing the speed of the network during the inference time, while maintaining an acceptable level of classification accuracy. In this section, we discuss prominent quantization and pruning techniques. Model pruning: A plethora of works show that a large number of network weights can be pruned without significant loss in performance (LeCun et al., 1990; Reed, 1993; Han et al., 2015) . Methods such as the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) , adapted by various works (Savarese et al., 2020; Frankle et al., 2019; Malach et al., 2020; Girish et al., 2021; Chen et al., 2021; 2020; Desai et al., 2019; Yu et al., 2020) prune models, while reaching the dense network performance, but are iterative and perform unstructured pruning. Other works prune at initialization (Lee et al., 2018; Wang et al., 2020; Liu & Zenke, 2020; Tanaka et al., 2020) and avoid multiple iterations, but show accuracy drops compared to the dense models (Frankle et al., 2020) . On the other hand, structured sparsity via filter/channel pruning offers practical speedups at the cost of accuracy (Wen et al., 2016; He et al., 2017; Huang & Wang, 2018) . Yuan et al. (2020) obtain almost no drops of network performance with structured sparsification but have lower levels of model compression rates due to storage of floating point weights. Other works operate on intermediate levels of structure such as N:M structured sparsity (Zhou et al., 2021) and block sparsity (Narang et al., 2017) . Niu et al. (2020) is the closest to ours in terms of pruning structure utilizing slice sparsity, along with an even finer pattern pruning. They show that such structure can be exploited for inference speedups. They, however, require predefining a filter pattern set and heuristics for determining layerwise sparsity. They also optimize for auxiliary variables and have additional training costs due to the dual optimization subproblem (Ren et al., 2019) . In contrast, our algorithm uses a single objective to jointly optimize for sparsity and model compression with very little impact on training complexity.

Quantized latent representations

Decoder 0 0 0 0 0 0 0 0 0 1 0 -1 2 9 -4 2 -3 1 Model Weights Image Logits -2 2 1 0 -4 5 2 1 -3 0 0 0 0 0 0 0 0 0 1.1 7.9 -3.1 1.1 2.1 -0.4 0.3 1.5 0.4 0.3 1.2 -0.1 1.3 -0.1 1.3 -0.4 0.9 3.0

Standard CNN

End-to-end joint optimization Sparse latents

Sparse weights

Figure 2 : Overview of our approach. A standard CNN comprises of a sequence of convolutional and fully-connected layers. We reparameterize the parameters W i of each of these layers as W i in a quantized latent space. Our decoder is such that sparsity in the quantized latents translate to sparsity in CNN parameters. Further, we organize each parameter tensor as a set of slices (depicted as colored bands) corresponding to different channels. Proposed training loss exploits this structure to encourage slice sparsity and jointly optimize for accuracy-compression-computation. Model quantization: Quantization methods discretize the parameters of a network to a small, finite set of values, to store them efficiently using entropy coding methods (Rissanen & Langdon, 1981) . Earlier methods uniformly quantize weights to binary or tertiary representations (Courbariaux et al., 2015; Li et al., 2016; Zhou et al., 2018; Zhu et al., 2016; Rastegari et al., 2016; Hubara et al., 2017) . Several other works focus on non-uniform Scalar Quantization (SQ) techniques (Tung & Mori, 2018; Zhou et al., 2017; Nagel et al., 2019; Banner et al., 2018; Wu et al., 2016; Zhang et al., 2018; Oktay et al., 2020) . Vector Quantization(VQ) (Gong et al., 2014; Stock et al., 2019; Wang et al., 2016; Chen et al., 2015; 2016) on the other hand, is a more general technique, where the representers can take any value. VQ can be done by clustering of CNN layers at various compression-accuracy trade-offs (Faraone et al., 2018; Son et al., 2018) , hashing (Chen et al., 2015; 2016) , or residual quantization (Gong et al., 2014) . Quantization works focus on reducing bit widths (Zhao et al., 2019; Jain et al., 2020) which has the effect of high model compression (Young et al., 2021) . A few works do provide inference speedups (Hubara et al., 2016; Jacob et al., 2018) by utilizing lower bit arithmetic but require custom hardware. In comparison, we focus on jointly optimizing for compression and computational gains, and leave optimization for lower precision arithmetic for future. We show the benefits of our approach in terms of even smaller model size compared to these quantization works along with computational gains while maintaining high levels of accuracy in §5.

3. APPROACH

We consider the task of classification using a convolutional neural network (CNN), although our approach can be trivially extended to other tasks such as object detection or generative modeling. Given a dataset of N images and their corresponding labels {x i , y i } N i=1 , our goal is to train a CNN with parameters Θ that are jointly optimized to: 1) maximize classification accuracy, 2) compress the model by minimizing the number of bits required to store the model on disk, and 3) minimize the computational cost of inference in the model by maximizing model sparsity. To keep our method end-to-end trainable, we formulate it as minimization of a joint objective that allows for an accuracycompression-computation trade-off as L(Θ) = L acc (Θ) + L compress (Θ) + L compute (Θ). (1) For the task of classification, the accuracy term L acc (Θ) is the usual cross-entropy loss. Given an image, it maximizes the probability assigned to the target label. The compression term L compress (Θ) encourages the model to have a small disk size. We reparameterize the model weights using a quantized latent representation space which are then stored on disk. The compression term, while encouraging smaller model size, doesn't lead to computational gains as the decoded model parameters are still dense. Our computation term L compute (Θ) addresses this issue by introducing a structured sparsity-inducing loss. Our framework allows the structured sparsity in latent weight space to directly translate to the structured sparsity in the decoded model weights. In our experiments, we demonstrate significant speedups in the inference time of our model with sparse weights using off-the-shelf libraries. Refer to Fig. 2 for a high-level overview of our approach. In the following sections, we describe the compression and computation terms in more detail.

3.1. COMPRESSION TERM

We formulate our compression term by building upon prior works that incorporate entropy penalty on parameter representation during training (Ballé et al., 2018; Oktay et al., 2020) . The model's weights and biases, Θ, are reparameterized as quantized latent representations which are compressed while the network parameters are implicitly defined as a transform of the latent representations. We represent the set of model parameters Θ = W 1 , b 1 , W 2 , b 2 , . . . , W N , b N where N is the total number of layers in the network, and W k , b k represent the weight and bias parameters of the k th layer. Each of these parameters can take continuous values during inference. However, these parameters are stored using quantized latent representations belonging to a corresponding set Φ = W 1 , b 1 , W 2 , b 2 , . . . , W N , b N . For each convolutional layer, W is a weight tensor of dimensions C in × C out × K × K, where C in is the number of input channels, and C out is the number of output channels, and K denotes the filter width and height. The corresponding quantized latent representation W is represented by a two-dimensional (2D) matrix of size C in C out × K 2 . For each dense layer, W is a tensor of dimension C in × C out and its corresponding latent representation W is a matrix of dimension C in C out ×1. All the biases can be represented in the same way as dense layers. Each latent representation is a quantized 2D matrix W ∈ Z CinCout×l where l = 1 for dense weights (and biases) while l = K 2 for convolutional weights. Each row/slice W i from W represents a sample drawn from an l-dimensional discrete probability distribution. In order to decode parameters from latent space Φ to model space Θ, learnable affine transforms Ψ are introduced. W i = Ψ scale W i + Ψ shift , W = reshape ([W 1 ... W i ...]) , i ∈ [1...C in C out ] (2) Where W i represents the i th row/slice of W and W i is the corresponding decoded slice. Ψ scale ∈ R l×l and Ψ shift ∈ R l are the affine transformation parameters. Different kinds of layers use different pairs of transform parameters (Ψ scale , Ψ shift ), i.e., different convolutional layers have their own transform, dense layers have their own, and so on. As Φ consists of discrete parameters which are difficult to optimize, continuous surrogates W are maintained for each quantized parameter W . W is thus simply obtained by rounding the elements of W to the nearest integer. A straight-through estimator (Bengio et al., 2013) is used to backpropagate the gradients from the classification loss to W . Bit-rate minimization is achieved by enforcing an entropy penalty on the surrogates W . For a given surrogate W ∈ R d×l with d samples of dimension l, we add uniform noise n ∼ U -1 2 , 1 2 , and fit l probability models {q j , j ∈ [1...l]} as proposed in Ballé et al. (2018) . The entropy of model weights can now be minimized directly by minimizing the negative log-likelihood which serves as an approximation to the self-information I (Eq. ( 3)). The compression term is then the sum of all the self-information terms q( W ) = CinCout i=1 l j=1 q j ( W i,j + n i,j ), I( W ) ≈ -log 2 q( W ), L compress (Θ) = λ I ϕ∈Φ I( ϕ) (3) Where λ I is a hyper-parameter specifying relative weight of the compression loss. After training, only the quantized representations W are stored by arithmetic coding using the probability tables from q j . We discard the continuous surrogates post training. During inference, we load the quantized latents W and decode them using the learnt decoder to obtain the continuous model weights W which is used in the model's forward pass.

3.2. SPARSITY PRIORS

The compression term described in the previous section encourages a smaller representation of the model in terms of the number of bits required to store the model on disk. However, once the model is decoded for inference, it can be fully dense with no reduction in terms of computation in a single forward pass of the model. To address this, we introduce a few key changes and then formulate our computation term as structured sparsity priors that lead to reduced computation. We formulate all the priors in the latent representation space to decouple from the affine transform parameters Ψ scale , Ψ shift and to be consistent with the L compress (Θ) term that is also applied in the same space. We observe from Eq. ( 2) that even if all the elements of the latent W i are zero, the resulting transformed slice W i may still be non-zero. In order to enforce structural sparsity in the parameters in the model space W , we require each K × K slice W i to be 0, the zero vector. However, this is only possible if (Ψ shift + W i ) is a zero vector itself or lies in the null space of Ψ scale . We notice that the latter does not occur in most practical situations especially when the vector W i is discrete. Therefore, we remove the shift parameter Ψ shift and make the affine transform a purely linear transform. W i = Ψ scale W i , W i ∈ Z l , Ψ scale ∈ R l×l (4) Note that the j th element in W i is zero only if the j th row of Ψ scale is orthogonal to W i or W i = 0. The former tend to be rare or nonexistent in practice due to Ψ scale being real-valued and W i being discrete. Thus, any single non-zero element in W i causes the transformed model vector W i to be nonzero and does not yield any sparsity in the model space. Loosely, we get W i = 0 ⇔ W i = 0. Unstructured sparsity of latents with Gaussian prior: Since a sparse model contains a majority of zeros, its weight distribution should peak at zero. However, the loss in Eq. ( 3) does not necessarily enforce zero latents, allowing for any non-zero constant value. A zero-mean Gaussian is one such distribution that enforces zero-centered latents. It also enforces other useful properties such as unimodality and symmetry, typically observed in trained uncompressed model weight distributions (Appendix B). Similar to Eq. ( 3), a Gaussian prior can be viewed as a compression penalty but with a Gaussian weight distribution. This corresponds to the l 2 norm penalty on W . q U ( W ) = CinCout i=1 l j=1 1 √ 2πσ 2 e -1 2 W 2 i,j and I U ( W ) ≈ -log 2 q U ( W ) L gaussian (Θ) = λ U W ∈Φ I U ( W ) = λ U W ∈Φ CinCout i=1 l j=1 ∥ W i,j ∥ 2 (6) Where λ U is the tradeoff parameter controlling how closely the weights follow a Gaussian distribution compared to the the fully factorized distribution from the probability models in Eq. (3). A laplacian prior, resulting in an l 1 penalty, can also be used but the difference between the two distributions is negligible due to the effects of the quantization on W . We experimented with both priors and observed similar performance as expected (refer Appendix G). Through experiments in Sec. 4, we show that this prior not only enforces model sparsity by design but also improves model compression. Structured sparsity of latents with group lasso: The Gaussian prior encourages individual weight values in the latents to be close to zero. However, from Eq. ( 4), we see that entire slices W i should be zero vectors to improve sparsity in the model space. Note that each slice W i can be represented as a group belonging to W , the set of groups (or slices). Thus, to enforce individual slices to go to zero as a whole, we propose to use a group sparsity regularization (Yuan & Lin, 2006) on each W i slice as follows. L group (Θ) = λ S W ∈Φ Cin×Cout i=1 √ ρ∥ W i ∥ 2 (7) where ρ accounts for varying group sizes and λ S is the tradeoff parameter for the structured sparsity.

3.3. JOINT OPTIMIZATION OBJECTIVE

The overall loss function is the combination of cross-entropy loss (for classification), selfinformation, and regularization for structured and unstructured sparsity of the latents as follows: Accuracy ∥ Wi,j∥2 Unstructured Sparsity + λS W ∈Φ C in ×Cout i=1 √ ρ∥ Wi∥2 Structured Sparsity (8) The above objective is fully differentiable and can be minimized end-to-end by any gradient-based optimizer. Note that while there are three tradeoff parameters, each of them is relatively independent and intuitively controls different aspects of the network. λ I controls for model size, λ U for enforcing a zero-mean prior necessary for sparsity, and λ S for slice sparsity and hence computation during network inference. l 2 weight decay on model weights is typically used as a regularization for improving generalization. However, our motivation for a Gaussian prior is for the objective detailed in Sec. 3.2. Furthermore, our group lasso prior and the Gaussian prior are applied in the latent representation space rather than the model space as in Wen et al. (2016) . Due to the presence of quantization and decoding, the effect of these priors on training is significantly different than when applying these priors directly to the model weights. We analyze the effect of these priors in Sec. 4.

Datasets.

We consider three datasets in our experiments. CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) consist of 50000 training and 10000 test color images each of size 32 × 32. For large scale experiments, we use ILSVRC2012 (ImageNet) dataset (Deng et al., 2009) . It has 1.2 million images for training, 50000 images for the test and 1000 classes. Network Architectures. For CIFAR-10/100 datasets, we show results using -VGG-16 (Simonyan & Zisserman, 2014) and ResNet-20 with a width multiplier of 4 (ResNet-20-4) (He et al., 2016) . VGG-16 is a commonly used architecture consisting of 13 convolutional layers of kernel size 3 × 3 and 3 dense or fully-connected layers. Dense layers are resized to adapt to CIFAR's 32 × 32 image size, as done in baseline approaches. ResNet-20-4 consists of 3 ResNet groups, each with 3 residual blocks. All convolutional layers are of size 3 × 3, along with a final dense layer. For the ImageNet experiments, we use ResNet-18/50 networks with one 7 × 7 convolutional layer, multiple 3 and 1 × 1 convolutional layers, and the final dense layer. We also run experiments with MobileNet-V2 (Sandler et al., 2018) , containing depthwise separable convolutions and inverted Bottleneck blocks. We use the Adam optimizer (Kingma & Ba, 2014) for updating all parameters of our models. The entropy model parameters are optimized with a learning rate of 10 -4 for all our experiments. The remaining parameters are optimized with a learning rate of 0.01 for CIFAR-10 experiments and a learning rate of 0.02 for ResNet-18/50 on ImageNet with a cyclic schedule. Our model compression results are reported using the torchac library (Mentzer et al., 2019) which does arithmetic coding of the weights given probability tables for the quantized values which we obtain from the probability models. We do not compress the biases and batch normalization (BN) parameters and include the additional sizes from these parameters as well as the parameters Ψ scale when reporting model size. Parameter groups for different networks based on types of convolutional/dense layers are provided in the appendix (Appendix E). (a) (b) Figure 3 : Effect of unstructured sparsity coefficient λ U . We show plots for 3 values of λ U varying λ S to obtain the Pareto curves on CIFAR-10. Higher λ U improves accuracy-model size tradeoff (left) by a good margin and also the accuracy-slice sparsity (right) by a small amount. Shaded areas represent the confidence intervals of the regression fit. Best viewed in color.

4. ANALYSIS

Recall from Eq. ( 8), we proposed two sparsity coefficients: λ U for unstructured sparsity, and λ S for structured sparsity. Note that the sparsity terms not only improve the model's inference speed (by increasing slice sparsity and reducing the number of FLOPs) but also reduce the entropy of latent weights W as most of the weights become zero after quantization. By varying the two sparsity coefficients, one for each sparsity term, we obtain different points on the Pareto curves for accuracy vs. model size trade-off and accuracy vs. slice sparsity trade-off. In this section, we study each of these trade-offs extensively. We use slice sparsity as a proxy for computational complexity or inference speed in this section for brevity, and revisit the computational complexity in terms of actual wall-clock inference time in Sec. 5.2. We use a constant compression coefficient λ I = 10 -4 . We show Pareto curves of model performance (accuracy) vs. model size (bit-rate), and the slice sparsity (%) by keeping one of λ U ,λ S fixed and varying the other. We analyze model sizes for the compressed parameters in this section as the remaining parameter sizes are constant. All results in this section are obtained by averaging over 3 runs with varying random seeds.

4.1. EFFECT OF UNSTRUCTURED SPARSITY REGULARIZATION

Fig. 3 shows model performance for three different values of λ U , while varying λ S for obtaining the Pareto trade-off curve for each case. Fig. 3 (a) and 3(b) show the impact on the top-1 accuracy as a function of model size and slice sparsity respectively. We see that increasing λ U (blue circles to orange crosses to green squares) improves the accuracy vs. model size tradeoff curve by a fair margin while also slightly improving the accuracy vs. slice sparsity curve. This is to be expected as higher λ U leads to a higher number of zeros in the latent representations leading to lesser entropy and consequently lower model size. We also see a marginal gain in slice sparsity even though the Gaussian prior doesn't directly optimize for it as a higher number of zeros eventually leads to more slices entirely going to zero. Thus, we see that the Gaussian prior helps with improving model size while also marginally improving slice sparsity. . This shows that the group lasso regularization is effective in improving the slice sparsity of the model necessary for computational benefits. Additionally, we see a small improvement in the accuracymodel size curve as higher λ S promotes a higher number of zero slices which indirectly leads to lower entropy of the overall latent representations.

4.3. STRUCTURED vs. UNSTRUCTURED SPARSITY REGULARIZATION

While both structured and unstructured sparsity regularization help improve slice sparsity, the latter optimizes for it indirectly. Fig. 5 shows the effect of λ U , λ S on unstructured and slice sparsity of the latent representations. As expected, we observe that increasing λ U (corresponding to larger points), for a fixed λ S , increases the sparsity of the latent representations while indirectly increasing slice sparsity as well. However, as we increase λ S we notice that the plots shift upwards implying higher structural sparsity for any fixed value of unstructured sparsity. Therefore, we see that our structured sparsity prior is effective in forcing non zero weights to lie in fewer weight slices thus leading to higher structured sparsity, ultimately leading to speedups in model inference. Note that while the structured sparsity constraint directly optimizes for the slice sparsity, the unstructured sparsity promotes sparsity within a group/slice, akin to Simon et al. (2013) . Unstructured sparsity has a strong effect on model size due to a large number of zeros and lower entropy as we show in Sec.4.1 while structured sparsity constraint directly affects slice sparsity as shown above and in Sec.4.2. Table 1 : Comparison of our approach against other model compression techniques. We show two cases of our method: Best and Extreme. Best corresponds to our best model in terms of error rate and compression factor while Extreme is matching the range of error of baselines if exists. We achieve higher compression along with the added computational benefits of high slice sparsity. "-" implies that the work performs pruning but does not report numbers in the paper. All our models are trained from scratch. CIFAR-10/100 experiments are trained for 200 epochs. We use the FFCV library (Leclerc et al., 2022) for faster ImageNet training, with a batch size of 512 for ResNet-18/50 split across 4 GPUs. We train ResNet-18/50 for 35 epochs to keep the range of the uncompressed network accuracies similar to other works for a fair comparison. We show strong performance in terms of model compression and sparsity outperforming existing model compression works while converging faster with relatively fewer epochs.

5.1. COMPARISON WITH COMPRESSION METHODS

As discussed in Sec. 2, existing approaches for model compression follow either quantization, pruning, or both. We compare with the state of the art methods in each of these two categories. Among model quantization methods, we use Oktay et al. (2020) and Young et al. (2021) for comparison, with the latter offering speedups via mix-precision inference albeit on specialized hardware. Our results are summarized in Table 1 . Unless otherwise noted, we use the numbers reported by the original papers. Since we do not have access to many prior art models, we compare using slice sparsity. For the CIFAR-10 dataset, we achieve the best results in compression while also achieving a lower Top-1 error rate for both VGG-16 and ResNet-20-4. For VGG-16 we obtain the best performance in the error range of t7% at 129KB which is a 465x compression compared to the baseline model. At the ∼10% error range, we outperform Oktay et al. (2020) in terms of model compression and also with a 99.2% slice sparsity. For ResNet-20-4, compared to Oktay et al. (2020) , we achieve almost twice the compression rate at a similar error rate of ∼8.5%, simultaneously achieving extremely high levels of slice sparsity (97.9%). Similar results hold for the case of CIFAR-100 where we achieve a 137× compression in model size with 86.7% slice sparsity and little to no drop in accuracy compared to the uncompressed model. For ResNet-18 trained on ImageNet, we achieve 30× compression as compared to the uncompressed model with almost equal error rate outperforming Oktay et al. (2020) . The compressed network achieves a sparsity of 33.3%. For an extreme compression case, we achieve higher levels of com-pression (54×) and sparsity (∼65%) at the cost of ∼2% accuracy compared to uncompressed model. For ResNet-50, our best model achieves a compression rate of 26×, along with 66.7% slice sparsity. An extreme case of our model achieves a higher compression rate of 34× with sparsity of 81.7% compared to the next best work of Young et al. (2021) with a rate of 21× at a similar error rate. We provide additional baseline comparisons for ResNet-50 in the supplementary material. Finally, for MobileNet-V2 which is already lightweight and optimized for computational efficiency, we still achieve 21× compression compared to the dense model along with 56.8% slice sparsity with almost no drop in accuracy. This is especially beneficial for MobileNets which consist of Depthwise Separable Convolutions as removing entire 2D slices of the convolutional weight allows for directly removing the input activation map's corresponding channels leading to lower FLOPs. We outperform other baselines at a similar accuracy both in terms of model compression and sparsity. We conclude that our framework outperforms state-of-the-art (SOTA) approaches in model compression by a fair margin while achieving network weights sparsification for computational gains. Instead of entropy coding, the sparse matrices can additionally be compressed using sparse matrix formats. We choose the two popular formats of Compressed Sparse Row (CSR) or Coordinate Format(COO). Results are summarized in Table 3 for our best run for ResNet-50 shown in Table 1 in the main paper. We see that entropy coding far outperforms the sparse formats of CSR and COO with COO obtaining better compression rates than CSR. This is expected as CSR/COO achieves high levels of compression only with extremely high levels of sparsity. With an unstructured sparsity level of ∼80%, storing only the non zero weights itself (and not their indices) provides a maximum compression of 5×. 

E PARAMETER GROUPS FOR VARIOUS NETWORKS

We share weight decoders and probability models for different parameter groups of a network which can be seen as being drawn from similar weight distributions. This limits the overhead in storing the weights of the corresponding decoders. We list the types of parameter groups for each network as follows: G l 2 vs. l 1 vs. l ∞ NORM In this section, we analyze the effect of different types of norm for both individual weights and groups. For individual weights, we compare the l 2 norm with the l 1 norm while for the group norm, we compare the l 2 norm with the l ∞ norm (l 1 weight norm is same as l 1 group norm due to sum of absolutes). Results are summarized in Fig. 11 where top/bottom rows are for CIFAR-10/100 respectively. We see that l 2 group norm outperforms its l ∞ counterpart for both datasets. However, l 1 norm has little additional effect in terms of l 2 weight norm. Additionally, the l 2 group norm yields lesser slice sparsity for a given sparsity (c,g) highlighting the importance of l ∞ for high structured sparsity. While l ∞ leads to higher sparsity, it also shows higher model size for a given slice sparsity. Thus, there is an inherent tradeoff for l ∞ which leads to more sparsity but also larger model sizes (d,h).

H INITIALIZATION OF CONTINUOUS SURROGATES

The initialization of the continuous surrogate W of a latent space weight W and the decoder matrix Ψ plays an important in the neural network training. Naïve He initialization (He et al., 2015) commonly used in training ResNet classifiers does not work in our case since small values of W get rounded to zero before decoding. Such an initialization results in zero gradients for updating the parameters and the loss becomes stagnant. To overcome this issue, we propose a modification to the initialization of the different parameters. In our framework, we recap that the decoded weights used in a forward pass are obtained using W = reshape( W Ψ) where W is a matrix in Z CinCout×l and Ψ is a matrix in Z l×l (where l = 1 for dense weights (and biases) while l = K 2 for convolutional weights). Our goal is to initialize W and Ψ such that the decoded weights W follow He initialization. First, since W is rounded to nearest integer (to obtain latent space weights W ), we assume its elements to be drawn from a uniform distribution in [-b, b] where b > 0.5 in order to enforce atleast some non-zero weights after rounding to nearest integer. Next, we take the elements of Ψ to be a normal distribution with mean 0 and variance v. Assuming the parameters to be i.i.d., and Var(X) denoting the variance of any individual element in matrix X, Var(W ) = l × Var(Ψ) × Var( W ) Assuming a RELU activation, with f denoting the total number of channels (fan-in or fan-out) for a layer, LHS of Eq. ( 10), using the He initializer becomes 2 f , RHS on the other hand can be obtained analytically 2 f = l × v × (2b + 1) 2 -1 12 =⇒ b = 24 lvf + 1 -1 2 , v = 24 lf ((2b + 1) 2 -1) Eq. ( 11) gives us a relationship between b (defining the uniform distribution of W ) and v (defining the normal distribution of Ψ). Note that l and f values are constant and known for each layer. For a weight decoder corresponding to a parameter group, the maximum value of f in that group enforces the smallest value of b which should be above a minimum limit b min . Denoting f max as the maximum fan-in or fan-out value for a parameter group, we get v = 24 lf max ((2b min + 1) 2 -1) =⇒ b = fmax f ((2b min + 1) 2 -1) + 1 -1 2 (12) The hyperparameter b min then refers to the minimum boundary any latent space parameter can take in the network. By calculating the values of v based on f max , b min and b for various parameters based on the corresponding value of f , we then initialize the elements of W to be drawn from a uniform distribution in the interval [-b, b] and elements of Ψ to be drawn from N (0, v). Note that f = f max =⇒ b = b min which shows that the minimum boundary corresponds to the layer with maximum channels (fan-in or fan-out) f . By choosing an appropriate value of b min we obtain good initial values of the gradient which allows the network to converge well as training progresses. b min offers an intuitive way of initializing the discrete weights. Too small a value leads to most of the weights being set to zero while too large a value can lead to exploding gradients. In practice, we find that this initialization approach works well for Cifar experiments. For ImageNet experiments, we assume a normal distribution instead of uniform distribution for W with a sufficiently high variance for the network to train. (g) (h)

I LICENSE

Figure 11 : Comparison of l 2 vs. l 1 vs. l ∞ norm for various metrics of sparsity and size for both CIFAR-10 (top row) and CIFAR-100 (bottom row). We see that l 2 group norm does better than l ∞ group norm in terms of accuracy vs model-size or slice sparsity (a,b,e,f). l 1 weight norm has little additional effect compared to l 2 weight norm. l ∞ favors higher slice sparsity for the same level of sparsity (c,g). l ∞ tends to result in higher model size for a given slice sparsity but also higher slice sparsity given a model size, which shows the tradeoff between compression and sparsification.



by utilizing fully structured sparsity, removing entire filters or input channels of weight tensors which are all zeros.Niu et al. (2020) show that slice sparsity, along with pattern sparsity, provide inference speedups on mobile devices via compiler-assisted optimizations. However, due to the lack of an open-source code base, we utilize the DeepSparse engine(Kurtz et al., 2020) which also exploits this sparsity for CPU inference speedups. Results are shown in Fig.6. For CIFAR-10 (left), even when only exploiting fully structured sparsity, we achieve nearly 2× levels of speedups for 95% slice sparsity. Speedups scale almost exponentially for sparsity>95%. For ImageNet (right), we obtain 1.86× speedup compared to the dense uncompressed model at 81.7% slice sparsity and even faster inference times for sparsity>85%. Therefore, our framework offers practical inference speedups via slice sparsity with no hardware modifications, along with high levels of model compression.6 CONCLUSIONWe propose a novel framework for training a deep neural network, while simultaneously optimizing the model size to reduce storage cost, and structured sparsity, to reduce computation cost. To the best of our knowledge, this is the first work on model compression that add priors for structured pruning of weights in quantized latent representation space.Experiments on three datasets and three network architectures show that our approach achieves stateof-the-art performance in terms of simultaneous compression and reduction in computation which directly translate to inference speedups. We also perform extensive ablation studies to verify that the proposed sparsity priors allow us to easily control the accuracy-compression-computation trade-off, which is an important consideration for the practical deployment of models.



(x,y)∼D -log p(y|x; W )

Figure 4: Effect of structured sparsity coefficient λ S . We show plots for 3 values of λ S varying λ U to obtain the Pareto curves. Fig. (a) and (b) correspond to Accuracy-Model Size trade-off and Accuracy-Slice Sparsity trade-off respectively on CIFAR-10. Highest λ S improves the accuracy-model sparsity tradeoff curve showing the importance of group lasso regularization for improving slice sparsity. Shaded areas represent the confidence intervals of the regression fit. Best viewed in color.Fig. 4 shows model performance for three values of λ s , while varying λ U to obtain the Pareto curves. Again, Fig. 4(a) and 4(b)show the impact on the top-1 accuracy as a function of model size and slice sparsity respectively. The highest value of λ S (green squares) shows a marked improvement of the accuracysparsity curve in Fig.4(b). This shows that the group lasso regularization is effective in improving the slice sparsity of the model necessary for computational benefits. Additionally, we see a small improvement in the accuracymodel size curve as higher λ S promotes a higher number of zero slices which indirectly leads to lower entropy of the overall latent representations.

Figure6: Speedups vs. Slice Sparsity (%). Speedup is the ratio of CPU throughput for the sparse models to the dense models. For both cases of ResNet-20 on CIFAR-10 (left) and ResNet-50 on ImageNet (right), we obtain practical inference speedups utilizing the sparsity of the compressed models.While we show compression gains with respect to the SOTA in Sec. 5.1, here we highlight the computational gains we get through slice sparsity. We measure the speedups of the trained sparse models by obtaining the ratio of the throughput (images/second) of the sparse compressed models to that of the dense uncompressed models. Speedups are measured on a single core of an AMD EPYC 7302 16-Core Processor with a batch size of 16. We show inference results for two cases: ResNet-20 on CIFAR-10 and ResNet-50 on ImageNet. As a high slice sparsity provides the added benefits of full structured sparsity (when all slices in a filter/channel are zero), we show speedups for ResNet-20 by utilizing fully structured sparsity, removing entire filters or input channels of weight tensors which are all zeros.Niu et al. (2020) show that slice sparsity, along with pattern sparsity, provide inference speedups on mobile devices via compiler-assisted optimizations. However, due to the lack of an open-source code base, we utilize the DeepSparse engine(Kurtz et al., 2020) which also exploits this sparsity for CPU inference speedups. Results are shown in Fig.6. For CIFAR-10 (left), even when only exploiting fully structured sparsity, we achieve nearly 2× levels of speedups for 95% slice sparsity. Speedups scale almost exponentially for sparsity>95%. For ImageNet (right), we obtain 1.86× speedup compared to the dense uncompressed model at 81.7% slice sparsity and even faster inference times for sparsity>85%. Therefore, our framework offers practical inference speedups via slice sparsity with no hardware modifications, along with high levels of model compression.

Figure9: Scatter plots with horizontal and vertical error bars for ResNet-20-4 trained on CIFAR-10/100. For a different random seed, model size changes leading to the error bar in the x-axis while the vertical bar represents the top-1 validation accuracy error on the y-axis. There is very little variance in CIFAR-10 and slightly higher for CIFAR-100 due to slow convergence as shown in Fig.10

Sparse formats: Comparison of the effect of entropy coding vs. sparse matrix formats of CSR, COO on model compression of a ResNet-50 trained on ImageNet. We show the model size in MB of the latent weights along with the sparsity of the model weights.

Licenses of datasets.

lists all datasets we used and their licenses.

APPENDIX A ADDITIONAL BASELINE COMPARISON

We compare our approach with additional pruning and quantization approaches in Table 2 for ResNet-50 trained on ImageNet. We see that we continue to achieve high levels of model compression along with slice sparsity for inference speedups. Yuan et al. (2020) achieve high levels of sparsity but are unstructured requiring dedicated hardware to obtain speedups. A similar case holds for the quantization approaches of Zhao et al. (2019) ; Jain et al. (2020) which can obtain inference speedups but with hardware optimized for 4-bit and 8-bit integer arithmetic. Additionally, they typically require post-hoc training stages (Jain et al., 2020) to improve performance after quantization while our approach is a single stage trained end-to-end.Table 2 : Comparison of our approach with other pruning and quantization approaches for ResNet-50 trained on ImageNet. We continue to achieve the most compression along with high slice sparsity. * denotes that the sparsity is unstructured and do not directly translate to computational benefits. 

B HISTOGRAM OF WEIGHTS FOR DENSE UNCOMPRESSED MODEL

We obtain the histogram of weights of the various types of layers of a dense uncompressed ResNet-50 model trained on ImageNet with only the cross entropy loss. We do not apply any weight decay in order to avoid enforcing any distribution on the weights. Results are shown in Fig. 7 . We show histograms for 1 × 1, 3 × 3, 7 × 7 convolutions as well as for the dense layer. For 3 × 3 and 7 × 7 convolutions, we pick a random dimension from a 9-dimensional or a 49-dimensional slice respectively, to highlight the histograms, as a single probability model is fit to each dimension as shown in Eq. ( 3). We see that the distributions naturally follow unimodality and are more or less zero-centered even without any weight decay regularization. The 7 × 7 convolution weight distribution is less continuous due to relatively fewer weight values per dimension (192) but still weakly exhibits the property of unimodality and symmetry. This shows that networks trained with vanilla cross entropy loss prefer such distributions naturally. However, the probability models in Eq. ( 3) do not enforce any such distribution and can take on any random distribution. Thus, enforcing a Gaussian prior as proposed in Sec. 3.2 promotes unimodality and symmetry of the weight distributions which can be beneficial for network performance.

C HISTOGRAM OF WEIGHTS FOR QUANTIZED LATENTS

To provide insights into the effect of our quantization, we visualize the histogram of the quantized latents as well for different weight groups. Results are shown in Fig. 8 . We see that we obtain high levels of 0s on almost all weight groups spanning different types of convolutional layers as well as the final dense layer. Fewer number of zeros are present in the initial 7 × 7 convolution similar to the uncompressed weights as shown in Fig. 7 highlighting its importance in the network. Additionally, high amount of elements are zeros in 3 × 3 convolutions highlighting their redundancy and potential for compression compared to other convolutional layers or the dense layer. • VGG-16 consists of a parameter group for each dense layer and a parameter group for all 3 × 3 convolutions leading to four weight decoders/probability models for each parameter group.• For ResNet-20-4 we use zero padding shortcut type A as defined in He et al. (2016) , which leads to only 2 parameter groups, one for the final dense layer and the other for all 3 × 3 convolutions.• For ResNet-18 trained on ImageNet, we use three parameter groups, for the initial 7x7 convolution, 3 × 3 convolutions, as well as the dense layer.• ResNet-50 consists of an additional parameter group for 1 × 1 convolutions compared to ResNet-18.• MobileNet-V2 consists of 3 parameter groups for the initial 3x3 convolution, final dense layer and the remaining 3x3 convolution.

F STANDARD ERROR FOR MULTIPLE RUNS

Sec. 5 in the main paper shows results when averaged across 3 seeds. In this section, we additionally provide the standard errors across the 3 random seeds. Results are summarized in Fig. 9 for the two datasets of CIFAR-10/100. CIFAR-10 shows little to no standard error both in the x-axis (model size) and y-axis (top-1 validation accuracy). This suggests that the training is stable for different random seeds. For CIFAR-100 however, we observe large error in the top-1 validation accuracy. We attribute this to the slow convergence for CIFAR-100 also highlighted in Fig. 10 .

CIFAR-100 Convergence:

We analyze the convergence of 3 different runs for ResNet-20-4 trained on the CIFAR-100 dataset with varying values of λ S and λ U . Results are shown in Fig. 10 when trained for 200 epochs. We see that validation accuracy (on the right y-axis) continues to increase towards the end of training between 190-200 epochs. At the same time, validation loss (on the left y-axis) also decreases. This suggests that the model hasn't fully converged by the end of 200 epochs. We hypothesize that this is an artifact of the dataset as well as the cosine decay schedule

