BIT-PRUNING: A SPARSE MULTIPLICATION-LESS DOT-PRODUCT

Abstract

Dot-product is a central building block in neural networks. However, multiplication (mult) in dot-product consumes intensive energy and space costs that challenge deployment on resource-constrained edge devices. In this study, we realize energy-efficient neural networks by exploiting a mult-less, sparse dotproduct. We first reformulate a dot-product between an integer weight and activation into an equivalent operation comprised of additions followed by bit-shifts (add-shift-add). In this formulation, the number of add operations equals the number of bits of the integer weight in binary format. Leveraging this observation, we propose Bit-Pruning, which removes unnecessary bits in each weight value during training to reduce the energy consumption of add-shift-add. Bit-Pruning can be seen as soft Weight-Pruning as it prunes bits, not the whole weight element. In extensive experiments, we demonstrate that sparse multless networks trained with Bit-Pruning show a better accuracy-energy trade-off than sparse mult networks trained with Weight-Pruning.

1. INTRODUCTION

Modern deep neural networks (DNNs) contain numerous dot-products between input features and weight matrices. However, it is well known that multiplication (mult) in dot-product consumes intensive energy and space costs, challenging DNNs' deployment on resource-constrained edge devices. This drives several attempts for efficient DNNs by reducing the energy of mult. Motivated by the significant energy reduction of sparse mult operations by Weight-Pruning and recent advances in the frameworks supporting unstructured sparsity, we envision new frontiers in the accuracy/energy trade-off by realizing sparse and mult-less dot-product for hardware supporting the unstructured sparsity. Noting that the dot-product between integer weights and activation can be decomposed into a PoT basis and binary vector, we first reformulate a dot-product between integer weight and activation (mult-add) into an equivalent operation comprised of additions followed by bit-shifts and additions (add-shift-add, Figure 1 ). In this formulation, the number of the first add operations equals the bitcount of the weight elements in binary format. From this observation, we propose Bit-Pruning, which removes unnecessary add (therefore unnecessary bits in weight) during training in a data-driven manner to reduce the energy consumption in the add-shift-add operation. Because of the difficulty of optimization in binary format, we optimize parameters on a high-precision differentiable mult-add network with bit-sparsity regularization, which promotes weights to be sparse in binary format. After training, the obtained bit-sparse mult-add network is converted to an equivalent add-shift-add network comprised of sparse add for efficient inference on DNN accelerators supporting unstructured sparsity. As depicted in Figure 1 , Bit-Pruning can be seen as a soft and fine-grained Weight-Pruning; it does not necessarily remove the whole weight elements but removes only some unnecessary bits of the weight values. This interpretation raises the following question: Which pruning offers a better accuracy/energy trade-off? mult in mult-add network (Weight-Pruning) or add in add-shift-add network (Bit-Pruning)? We conducted an extensive evaluation to answer the question (Section 4); the results suggest that pruning the add is several times more energy efficient than pruning the mult. Remark: We do not argue that the add-shift-add representation itself is efficient. In fact, the estimated energy consumption of (dense) mult-add and add-shift-add without pruning are almost the same, as we see in Section 3.3. Our research interest is to investigate whether the finegrained pruning by removing bits rather than weights can find a more efficient sparse structure; the add-shift-add representation permits this investigation.

2.1. DOT-PRODUCT WITH MULT-ADD

The dot-product is a fundamental building block of neural networks. In this study, we mainly focus on the dot-product that appears in convolutionfoot_0 . Let the weights and activation be quantized to M



The same discussion can be applied for dot-product in another computation block such as multi-layer perceptron or self/cross attention.



Figure1: Dot-product realized using sparse mult-add (left) and equivalent representation using add-shift-add(right). In the mult-add, computation is reduced by pruning the entire weight (mult). In contrast, in the add-shift-add computation is reduced by pruning the bit (add) in binary format.

Energy [pJ]  and area [µm 2 ] cost on ASIC That is, they limit model capacity and impose training with a precision that is difficult to use gradient-based optimization, e.g., approximate gradient with the straight-through estimator (STE)(Yin et al., 2019), making it challenging to achieve good accuracy in the low-bit

