BSQ: EXPLORING BIT-LEVEL SPARSITY FOR MIXED-PRECISION NEURAL NETWORK QUANTIZATION

Abstract

Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manuallydesigned search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.

1. INTRODUCTION

Numerous deep neural network (DNN) models have been designed to tackle real-world problems and achieved beyond-human performance. DNN models commonly demand extremely high computation cost and large memory consumption, making the deployment and real-time processing on embedded and edge devices difficult (Han et al., 2015b; Wen et al., 2016) . To address this challenge, model compression techniques, such as pruning (Han et al., 2015b; Wen et al., 2016; Yang et al., 2020) , factorization (Jaderberg et al., 2014; Zhang et al., 2015) and fixed-point quantization (Zhou et al., 2016; Wu et al., 2019; Dong et al., 2019) , have been extensively studied. Among them, fixed-point quantization works directly on the data representation by converting weight parameters originally in the 32-bit floating-point form to low-precision values in a fixed-point format. For a DNN model, its quantized version requires much less memory for weight storage. Moreover, it can better utilize fixed-point processing units in mobile and edge devices to run much faster and more efficiently. Typically, model compression techniques aim to reduce a DNN model size while maintaining its performance. The two optimization objectives in this tradeoff, however, have a contrary nature: the performance can be formulated as a differentiable loss function L(W ) w.r.t. the model's weights W ; yet the model size, typically measured by the number of non-zero parameters or operations, is a discrete function determined mainly by the model architecture. To co-optimize the performance and model size, some previous pruning and factorization methods relax the representation of model size as a differentiable regularization term R(W ). For example, group Lasso (Wen et al., 2016) and DeepHoyer (Yang et al., 2020) induce weight sparsity for pruning, and the attractive force regularizer (Wen et al., 2017) and nuclear norm (Xu et al., 2018) are utilized to induce low rank. The combined objective L(W ) + αR(W ) can be directly minimized with a gradient-based optimizer for optimizing the performance and model size simultaneously. Here, the hyperparameter α controls the strength of the regularization and governs the performance-size tradeoff of the compressed model. Unlike for pruning and factorization, there lacks a well-defined differentiable regularization term that can effectively induce quantization schemes. Early works in quantization mitigate the tradeoff exploration complexity by applying the same precision to the entire model. This line of research focuses on improving the accuracy of ultra low-precision DNN models, e.g., quantizing all the weights to 3 or less bits (Zhou et al., 2016; Zhang et al., 2018) , even to 1-bit (Rastegari et al., 2016) . These models commonly incur significant accuracy loss, even after integrating emerging training techniques like straight-through estimator (Bengio et al., 2013; Zhou et al., 2016) , dynamic range scaling (Polino et al., 2018) and non-linear trainable quantizers (Zhang et al., 2018) . As different layers of a DNN model present different sensitivities with performance, a mixed-precision quantization scheme would be ideal for the performance-size tradeoff (Dong et al., 2019) . There have also been accelerator designs to support the efficient inference of mixed-precision DNN models (Sharma et al., 2018) . However, to achieve the optimal layer-wise precision configuration, it needs to exhaustively explore the aforementioned discrete search space, the size of which grows exponentially with the number of layers. Moreover, the dynamic change of each layer's precision cannot be formulated into a differentiable objective, which hinders the efficiency of the design space exploration. Prior studies (Wu et al., 2019; Wang et al., 2019) utilize neural architecture search (NAS), which suffers from extremely high searching cost due to the large space of mixed-precision quantization scheme. Recently, Dong et al. (2019) propose to rank each layer based on the corresponding Hessian information and then determine the relative precision order of layers based on their ranking. The method, however, still requires to manually select the precision level for each layer. Here, we propose to revisit the fixed-point quantization process from a new angle of bit-level sparsity: decreasing the precision of a fixed-point number can be taken as forcing one or a few bits, most likely the least significant bit (LSB), to be zero; and reducing the precision of a layer is equivalent to zeroing out a specific bit of all the weight parameters of the layer. In other words, the precision reduction can be viewed as increasing the layer-wise bit-level sparsity. By considering the bits of fixed-point DNN parameters as continuous trainable variables during DNN training, we can utilize a sparsity-inducing regularizer to explore the bit-level sparsity with gradient-based optimization, dynamically reduce the layer precision and lead to a series of mixed-precision quantization schemes. More specific, we propose Bit-level Sparsity Quantization (BSQ) method with the following contributions: • We propose a gradient based training algorithm for bit-level quantized DNN models. The algorithm considers each bit of quantized weights as an independent trainable variable and enables the gradient-based optimization with straight-through estimator (STE). • We propose a bit-level group Lasso regularizer to dynamically reduce the weight precision of every layer and therefore induce mixed-precision quantization schemes. • BSQ uses only one hyperparameter, the strength of the regularizer, to trade-off the model performance and size, making the exploration more efficient. This work exclusively focuses on layer-wise mixed-precision quantization, which is the granularity considered in most previous works. However, the flexibility of BSQ enables it to explore mixedprecision quantization of any granularity with the same cost regardless of the search space size. 2 RELATED WORKS ON DNN QUANTIZATION Quantization techniques convert floating-point weight parameters to low-precision fixed-point representations. Directly quantizing a pre-trained model inevitably introduces significant accuracy loss. So many of early research focus on how to finetune quantized models in low-precision configurations. As the quantized weights adopt discrete values, conventional gradient-based methods that are designed for continuous space cannot be directly used for training quantized models. To mitigate this problem, algorithms like DoReFa-Net utilize a straight-through estimator (STE) to approximate the quantized model training with trainable floating-point parameters (Zhou et al., 2016) . As shown in Equation (1), a floating-point weight element w is kept throughout the entire training process. Along the forward pass, the STE will quantize w to n-bit fixed-point representation w q , which will be used to compute the model output and loss L. During the backward pass, the STE will directly pass the gradient w.r.t. w q onto w, which enables w to be updated with the standard gradient-based optimizer. Forward: w q = 1 2 n -1 Round[(2 n -1)w]; Backward: ∂L ∂w = ∂L ∂w q . (1)

