SEMI-RELAXED QUANTIZATION WITH DROPBITS: TRAINING LOW-BIT NEURAL NETWORKS VIA BIT-WISE REGULARIZATION

Abstract

Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from large quantization error due to the high temperature for low variance of gradients, hence showing suboptimal performance. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the quantization error, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to reduce the distribution bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.

1. INTRODUCTION

Deep neural networks have achieved great success in various computer vision applications such as image classification, object detection/segmentation, pose estimation, action recognition, and so on. However, state-of-the-art neural network architectures require too much computation and memory to be deployed to resource-limited devices. Therefore, researchers have been exploring various approaches to compress deep neural networks to reduce their memory usage and computation cost. In this paper, we focus on neural network quantization, which aims to reduce the bit-width of a neural network while maintaining competitive performance with a full-precision network. It is typically divided into two groups, uniform and heterogeneous quantization. In uniform quantization, one of the simplest methods is to round the full-precision weights and activations to the nearest grid points: x = α x α + 1 2 where α controls the grid interval size. However, this naïve approach incurs severe performance degradation on large datasets. Recent network quantization methods tackle this problem from different perspectives. In particular, Relaxed Quantization (RQ) (Louizos et al., 2019) employs Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017) to force weights and activations to be located near quantization grids with high density. Louizos et al. (2019) notice the importance of keeping the gradient variance small, which leads them to use high Gumbel-Softmax temperatures in RQ. However, such high temperatures may cause a large quantization error, thus preventing quantized networks from achieving comparable performance to full-precision networks. To resolve this issue, we first propose Semi-Relaxed Quantization (SRQ) that uses the mode of the original categorical distribution in the forward pass, which induces small quantization error. It is clearly distinguished from Gumbel-Softmax choosing argmax among the samples from the concrete distribution. To cluster weights cohesively around quantization grid points, we devise a multi-class straight-through estimator (STE) that allows for efficient gradient-based optimization as well. As this STE is biased like a traditional one (Bengio et al., 2013) for the binary case, we present a novel technique, DropBits to reduce the distribution bias of the multi-calss STE in SRQ. Motivated from Dropout (Srivastava et al., 2014) , DropBits drops bits rather than neurons/filters to train low-bit neural networks under SRQ framework. In addition to uniform quantization, DropBits allows for heterogeneous quantization, which learns different bit-width per parameter/channel/layer by dropping redundant bits. DropBits with learnable bit-drop rates adaptively finds out the optimal bit-width for each group of parameters, possibly further reducing the overall bits. In contrast to recent studies (Wang et al., 2019; Uhlich et al., 2020) in heterogeneous quantization that exhibit almost all layers possess at least 4 bits, up to 10-bit, our method yields much more resource-efficient low-bit neural networks with at most 4 bits for all layers. With trainable bit-widths, we also articulate a new hypothesis for quantization where one can find the learned bit-width network (termed a 'quantized sub-network') which can perform better than the network with the same but fixed bit-widths from scratch. Our contribution is threefold: • We propose a new quantization method, Semi-Relaxed Quantization (SRQ) that introduces the multi-class straight-through estimator to reduce the quantization error of Relaxed Quantization for transforming continuous activations and weights to discrete ones. We further present a novel technique, DropBits to reduce the distribution bias of the multi-class straight-through estimator in SRQ. • Extending DropBits technique, we propose a more resource-efficient heterogeneous quantization algorithm to curtail redundant bit-widths across groups of weights and/or activations (e.g. across layers) and verify that our method is able to find out 'quantized sub-networks'. • We conduct extensive experiments on several benchmark datasets to demonstrate the effectiveness of our method. We accomplish new state-of-the-art results for ResNet-18 and Mo-bileNetV2 on the ImageNet dataset when all layers are uniformly quantized.

2. RELATED WORK

While our goal in this work is to obtain an extremely low-bit neural network both for weights and activations, here we broadly discuss existing quantization techniques with various goals and settings. BinaryConnect (Courbariaux et al., 2015) first attempted to binarize weights to ±1 by employing deterministic or stochastic operation. To obtain better performance, various studies (Rastegari et al., 2016; Li et al., 2016; Achterhold et al., 2018; Shayer et al., 2018) have been conducted in binarization and ternarization. To reduce hardware cost for inference, Geng et al. ( 2019) proposed softmax approximation via a look-up table. Although these works effectively decrease the model size and raise the accuracy, they are limited to quantizing weights with activations remained in full-precision. To take full advantage of quantization at run-time, it is necessary to quantize activations as well. Researchers have recently focused more on simultaneously quantizing both weights and activations (Zhou et al., 2016; Yin et al., 2018; Choi et al., 2018; Zhang et al., 2018; Gong et al., 2019; Jung et al., 2019; Esser et al., 2020) . XNOR-Net (Rastegari et al., 2016) , the beginning of this line of work, exploits the efficiency of XNOR and bit-counting operations. QIL (Jung et al., 2019) also quantizes weights and activations by introducing parametrized learnable quantizers that can be trained jointly with weight parameters. Esser et al. (2020) recently presented a simple technique to approximate the gradients with respect to the grid interval size to improve QIL. Nevertheless, these methods do not quantize the first or last layer, which leaves a room to improve power-efficiency.

