BSQ: EXPLORING BIT-LEVEL SPARSITY FOR MIXED-PRECISION NEURAL NETWORK QUANTIZATION

Abstract

Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manuallydesigned search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.

1. INTRODUCTION

Numerous deep neural network (DNN) models have been designed to tackle real-world problems and achieved beyond-human performance. DNN models commonly demand extremely high computation cost and large memory consumption, making the deployment and real-time processing on embedded and edge devices difficult (Han et al., 2015b; Wen et al., 2016) . To address this challenge, model compression techniques, such as pruning (Han et al., 2015b; Wen et al., 2016; Yang et al., 2020) , factorization (Jaderberg et al., 2014; Zhang et al., 2015) and fixed-point quantization (Zhou et al., 2016; Wu et al., 2019; Dong et al., 2019) , have been extensively studied. Among them, fixed-point quantization works directly on the data representation by converting weight parameters originally in the 32-bit floating-point form to low-precision values in a fixed-point format. For a DNN model, its quantized version requires much less memory for weight storage. Moreover, it can better utilize fixed-point processing units in mobile and edge devices to run much faster and more efficiently. Typically, model compression techniques aim to reduce a DNN model size while maintaining its performance. The two optimization objectives in this tradeoff, however, have a contrary nature: the performance can be formulated as a differentiable loss function L(W ) w.r.t. the model's weights W ; yet the model size, typically measured by the number of non-zero parameters or operations, is a discrete function determined mainly by the model architecture. To co-optimize the performance and model size, some previous pruning and factorization methods relax the representation of model size as a differentiable regularization term R(W ). For example, group Lasso (Wen et al., 2016) and DeepHoyer (Yang et al., 2020) induce weight sparsity for pruning, and the attractive force regularizer (Wen et al., 2017) and nuclear norm (Xu et al., 2018) are utilized to induce low rank. The combined objective L(W ) + αR(W ) can be directly minimized with a gradient-based optimizer for optimizing the performance and model size simultaneously. Here, the hyperparameter α controls the strength of the regularization and governs the performance-size tradeoff of the compressed model.

