ABS: AUTOMATIC BIT SHARING FOR MODEL COM-PRESSION

Abstract

We present Automatic Bit Sharing (ABS) to automatically search for optimal model compression configurations (e.g., pruning ratio and bitwidth). Unlike previous works that consider model pruning and quantization separately, we seek to optimize them jointly. To deal with the resultant large designing space, we propose a novel super-bit model, a single-path method, to encode all candidate compression configurations, rather than maintaining separate paths for each configuration. Specifically, we first propose a novel decomposition of quantization that encapsulates all the candidate bitwidths in the search space. Starting from a low bitwidth, we sequentially consider higher bitwidths by recursively adding reassignment offsets. We then introduce learnable binary gates to encode the choice of bitwidth, including filter-wise 0-bit for pruning. By jointly training the binary gates in conjunction with network parameters, the compression configurations of each layer can be automatically determined. Our ABS brings two benefits for model compression: 1) It avoids the combinatorially large design space, with a reduced number of trainable parameters and search costs. 2) It also averts directly fitting an extremely low bit quantizer to the data, hence greatly reducing the optimization difficulty due to the non-differentiable quantization. Experiments on CIFAR-100 and ImageNet show that our methods achieve significant computational cost reduction while preserving promising performance.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved great success in many challenging computer vision tasks, including image classification (Krizhevsky et al., 2012; He et al., 2016) and object detection (Lin et al., 2017a; b) . However, a deep model usually has a large number of parameters and consumes huge amounts of computational resources, which remains great obstacles for many applications, especially on resource-limited devices with limited memory and computational resources, such as smartphones. To reduce the number of parameters and computational overhead, many methods (He et al., 2019; Zhou et al., 2016) have been proposed to conduct model compression by removing the redundancy while maintaining the performance. In the last decades, we have witnessed a lot of model compression methods, such as network pruning (He et al., 2017; 2019) and quantization (Zhou et al., 2016; Hubara et al., 2016) . Specifically, network pruning reduces the model size and computational costs by removing redundant modules while network quantization maps the full-precision values to low-precision ones. It has been shown that sequentially perform network pruning and quantization is able to get a compressed network with small model size and lower computational overhead (Han et al., 2016) . However, performing pruning and quantization in a separate step may lead to sub-optimal results. For example, the best quantization strategy for the uncompressed network is not necessarily the optimal one after network pruning. Therefore, we need to consider performing pruning and quantization simultaneously. Recently, many attempts have been made to automatically determine the compression configurations of each layer (i.e., pruning ratios, and/or bitwidths), either based on reinforcement learning (RL) (Wang et al., 2019) , evolutionary search (ES) (Wang et al., 2020) , Bayesian optimization (BO) (Tung & Mori, 2018) or differentiable methods (Wu et al., 2018; Dong & Yang, 2019) . In particular, previous differentiable methods formulate model compression as a differentiable searching problem to explore the search space using gradient-based optimization. As shown in Figure 1 (a), each candi-

