SEMI-RELAXED QUANTIZATION WITH DROPBITS: TRAINING LOW-BIT NEURAL NETWORKS VIA BIT-WISE REGULARIZATION

Abstract

Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged as one of the key ingredients to reduce the size of neural networks for their deployments to resource-limited devices. In order to overcome the nature of transforming continuous activations and weights to discrete ones, recent study called Relaxed Quantization (RQ) [Louizos et al. 2019] successfully employ the popular Gumbel-Softmax that allows this transformation with efficient gradient-based optimization. However, RQ with this Gumbel-Softmax relaxation still suffers from large quantization error due to the high temperature for low variance of gradients, hence showing suboptimal performance. To resolve the issue, we propose a novel method, Semi-Relaxed Quantization (SRQ) that uses multi-class straight-through estimator to effectively reduce the quantization error, along with a new regularization technique, DropBits that replaces dropout regularization to randomly drop the bits instead of neurons to reduce the distribution bias of the multi-class straight-through estimator in SRQ. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer using DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.

1. INTRODUCTION

Deep neural networks have achieved great success in various computer vision applications such as image classification, object detection/segmentation, pose estimation, action recognition, and so on. However, state-of-the-art neural network architectures require too much computation and memory to be deployed to resource-limited devices. Therefore, researchers have been exploring various approaches to compress deep neural networks to reduce their memory usage and computation cost. In this paper, we focus on neural network quantization, which aims to reduce the bit-width of a neural network while maintaining competitive performance with a full-precision network. It is typically divided into two groups, uniform and heterogeneous quantization. In uniform quantization, one of the simplest methods is to round the full-precision weights and activations to the nearest grid points: x = α x α + 1 2 where α controls the grid interval size. However, this naïve approach incurs severe performance degradation on large datasets. Recent network quantization methods tackle this problem from different perspectives. In particular, Relaxed Quantization (RQ) (Louizos et al., 2019) employs Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017) to force weights and activations to be located near quantization grids with high density. Louizos et al. (2019) notice the importance of keeping the gradient variance small, which leads them to use high Gumbel-Softmax temperatures in RQ. However, such high temperatures may cause a large quantization error, thus preventing quantized networks from achieving comparable performance to full-precision networks.

