LILNETX: LIGHTWEIGHT NETWORKS WITH EXTREME MODEL COMPRESSION AND STRUCTURED SPARSIFICATION

Abstract

We introduce LilNetX, an end-to-end trainable technique for neural networks that enables learning models with specified accuracy-compression-computation tradeoff. Prior works approach these problems one at a time and often require postprocessing or multistage training. Our method, on the other hand, constructs a joint training objective that penalizes the self-information of network parameters in a latent representation space to encourage small model size, while also introducing priors to increase structured sparsity in the parameter space to reduce computation. When compared with existing state-of-the-art model compression methods, we achieve up to 50% smaller model size and 98% model sparsity on ResNet-20 on the CIFAR-10 dataset as well as 31% smaller model size and 81% structured sparsity on ResNet-50 trained on ImageNet while retaining the same accuracy as these methods. The resulting sparsity can improve the inference time by a factor of almost 1.86× in comparison to a dense ResNet-50 model. Code is available at https://github.com/Sharath-girish/LilNetX.

1. INTRODUCTION

Reduction in Model Size % Reduction in FLOPs Figure Our method jointly optimizes for size on disk and structured sparsity. We compare various approaches using ResNet-50 architecture on ImageNet and plot FLOPs (y-axis) vs. size (x-axis) for models with similar accuracy. Prior model compression methods optimize for either quantization (■) or pruning (▲) objectives. Our approach, LilNetX, enables training while optimizing for both compression (model size) as well as computation (structured sparsity). Refer Table 1 for details. Recent research in deep neural networks (DNNs) has shown that large performance gains can be achieved on a variety of real world tasks simply by employing larger parameter-heavy and computationally intensive architectures (He et al., 2016; Dosovitskiy et al., 2020) . However, as DNNs proliferate in the industry, they often need to be trained repeatedly, transmitted over the network to different devices, and need to perform under hardware constraints with minimal loss in accuracy, all at the same time. Hence, finding ways to reduce the storage size of the models on the devices while simultaneously improving their run-time is of utmost importance. This paper proposes a general-purpose neural network training framework to jointly optimize the model parameters for accuracy, the model size on the disk, and computation, on any given task. Over the last few years, research on training smaller and efficient DNNs has followed two seemingly parallel tracks with different goals. One line of work focuses on model compression to deal with storage and communication network bottlenecks when deploying big models or a large number of small models. While they achieve high levels of compression in terms of memory, their focus is not on reducing computation. These works either require additional algorithms with some form of post hoc training (Yeom et al., 2021) or quantize the network parameters at the cost of network performance (Courbariaux et al., 2015; Li et al., 2016) . The other line of work focuses on reducing computation through various model pruning techniques (Han et al., 2015; Frankle & Carbin, 2018; Evci et al., 2020) . Their focus is to decrease the number of Floating Point Operations (FLOPs) of the network at inference time, while still achieving some compression due to fewer parameters. Typically, the cost of storing these pruned networks on disk is much higher than dedicated model compression works. In this work, we bridge the gap between the two lines of work and show that it is indeed possible to train a neural network while jointly optimizing for both the compression to reduce disk space as well as structured sparsity to reduce computation (Fig. 1 ). We maintain quantized latent representations for the model weights and penalize the entropy of these latents. This idea of reparameterized quantization (Oktay et al., 2020) is extremely effective in reducing the effective model size on the disk. However, it requires the full dense model during the inference. To address this shortcoming, we introduce priors to encourage structured and unstructured sparsity in the representations along with key design changes. Our priors reside in the latent representation space while encouraging sparsity in the model space. More specifically, we use the notion of slice sparsity, a form of structured sparsity where a K × K slice is fully zero for a convolutional kernel of size K and C channels. Unlike unstructured sparsity which has irregular memory access and offers a little practical speedup, slicestructured sparsity allows for removing entire kernel slices per filter, thus reducing channel size for the convolution of each filter. Additionally, it is more fine-grained than fully structured channel/filter sparsity works (He et al., 2017; Mao et al., 2017) which typically lead to accuracy drops. Extensive experimentation on three standard datasets shows that our framework achieves high levels of structured sparsity in the trained models. Additionally, the introduced priors show gains even in model compression compared to previous state-of-the-art. By varying the weight of the priors, we establish a trade-off between model size, sparsity, and accuracy. Along with model compression, we achieve inference speedups by exploiting the sparsity in the trained models. We dub our method LilNetX -Lightweight Networks with EXtreme Compression and Structured Sparsification. Our contributions are summarized below. • We introduce LilNetX, an algorithm to jointly perform model compression and structured sparsification for direct computational gains in network inference. Our algorithm can be trained end-toend using a single joint optimization objective without any post-hoc training or post-processing. • With extensive ablation studies and results, we show the effectiveness of our approach while outperforming existing approaches in both model compression and pruning, in most network and dataset setups, obtaining inference speedups in comparison to the dense baselines.

2. RELATED WORK

Typical model compression methods usually follow some form of quantization, parameter pruning, or both. Both lines of work focus on reducing the size of the model on the disk, and/or increasing the speed of the network during the inference time, while maintaining an acceptable level of classification accuracy. In this section, we discuss prominent quantization and pruning techniques. Model pruning: A plethora of works show that a large number of network weights can be pruned without significant loss in performance (LeCun et al., 1990; Reed, 1993; Han et al., 2015) . Methods such as the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) , adapted by various works (Savarese et al., 2020; Frankle et al., 2019; Malach et al., 2020; Girish et al., 2021; Chen et al., 2021; 2020; Desai et al., 2019; Yu et al., 2020) prune models, while reaching the dense network performance, but are iterative and perform unstructured pruning. Other works prune at initialization (Lee et al., 2018; Wang et al., 2020; Liu & Zenke, 2020; Tanaka et al., 2020) and avoid multiple iterations, but show accuracy drops compared to the dense models (Frankle et al., 2020) . On the other hand, structured sparsity via filter/channel pruning offers practical speedups at the cost of accuracy (Wen et al., 2016; He et al., 2017; Huang & Wang, 2018) 2020) is the closest to ours in terms of pruning structure utilizing slice sparsity, along with an even finer pattern pruning. They show that such structure can be exploited for inference speedups. They, however, require predefining a filter pattern set and heuristics for determining layerwise sparsity. They also optimize for auxiliary variables and have additional training costs due to the dual optimization subproblem (Ren et al., 2019) . In contrast, our algorithm uses a single objective to jointly optimize for sparsity and model compression with very little impact on training complexity.



. Yuan et al. (2020) obtain almost no drops of network performance with structured sparsification but have lower levels of model compression rates due to storage of floating point weights. Other works operate on intermediate levels of structure such as N:M structured sparsity (Zhou et al., 2021) and block sparsity (Narang et al., 2017). Niu et al. (

