WRAPNET: NEURAL NET INFERENCE WITH ULTRA-LOW-PRECISION ARITHMETIC

Abstract

Low-precision neural networks represent both weights and activations with few bits, drastically reducing the cost of multiplications. Meanwhile, these products are accumulated using high-precision (typically 32-bit) additions. Additions dominate the arithmetic complexity of inference in quantized (e.g., binary) nets, and high precision is needed to avoid overflow. To further optimize inference, we propose WrapNet, an architecture that adapts neural networks to use low-precision (8-bit) additions while achieving classification accuracy comparable to their 32bit counterparts. We achieve resilience to low-precision accumulation by inserting a cyclic activation layer that makes results invariant to overflow. We demonstrate the efficacy of our approach using both software and hardware platforms.

1. INTRODUCTION

Significant progress has been made in quantizing (or even binarizing) neural networks, and numerous methods have been proposed that reduce the precision of weights, activations, and even gradients while retaining high accuracy (Courbariaux et al., 2016; Hubara et al., 2016; Li et al., 2016; Lin et al., 2017; Rastegari et al., 2016; Zhu et al., 2016; Dong et al., 2017; Zhu et al., 2018; Choi et al., 2018a; Zhou et al., 2016; Li et al., 2017; Wang et al., 2019; Jung et al., 2019; Choi et al., 2018b; Gong et al., 2019) . Such quantization strategies make neural networks more hardware-friendly by leveraging fast, integer-only arithmetic, replacing multiplications with simple bit-wise operations, and reducing memory requirements and bandwidth. Unfortunately, the gains from quantization are limited because quantized networks still require high-precision arithmetic. Even if weights and activations are represented with just one bit, deep feature computation requires the summation of hundreds or even thousands of products. Performing these summations with low-precision registers results in integer overflow, contaminating downstream computations and destroying accuracy. Moreover, as multiplication costs are slashed by quantization, high-precision accumulation starts to dominate the arithmetic cost. Indeed, our own hardware implementations show that an 8-bit × 8-bit multiplier consumes comparable power and silicon area to a 32-bit accumulator. When reducing the precision to a 3-bit × 1-bit multiplier, a 32-bit accumulator consumes more than 10× higher power and area; see Section 4.5. Evidently, low-precision accumulators are the key to further accelerating quantized nets. In custom hardware, low-precision accumulators reduce area and power requirements while boosting throughput. On general-purpose processors, where registers have fixed size, low-precision accumulators are exploited through bit-packing, i.e., by representing multiple low-precision integers side-by-side within a single high-precision register (Pedersoli et al., 2018; Rastegari et al., 2016; Bulat & Tzimiropoulos, 2019) . Then, a single vector instruction is used to perform the same operation across all of the packed numbers. For example, a 64-bit register can be used to execute eight parallel 8-bit additions, thus increasing the throughput of software implementations. Hence, the use of low-precision accumulators is advantageous for both hardware and software implementations, provided that integer overflow does not contaminate results. We propose WrapNet, a network architecture with extremely low-precision accumulators. WrapNet exploits the fact that integer computer arithmetic is cyclic, i.e, numbers are accumulated until they reach the maximum representable integer and then "wrap around" to the smallest representable integer. To deal with such integer overflows, we place a differentiable cyclic (periodic) activation function immediately after the convolution (or linear) operation, with period equal to the difference between the maximum and minimum representable integer. This strategy makes neural networks resilient to overflow as the activations of neurons are unaffected by overflows during convolution. We explore several directions with WrapNet. On the software side, we consider the use of bitpacking for processors with or without dedicated vector instructions. In the absence of vector instructions, overflows in one packed integer may produce a carry bit that contaminates its neighboring value. We propose training regularizers that minimize the effects of such contamination artifacts, resulting in networks that leverage bit-packed computation with very little impact on final accuracy. For processors with vector instructions, we modify the Gemmlowp library (Jacob et al., 2016) to operate with 8-bit accumulators. Our implementation achieves up to 2.4× speed-up compared to a 32-bit accumulator implementation, even when lacking specialized instructions for 8-bit multiplyaccumulate. We also demonstrate the efficacy of WrapNet in terms of cycle time, area, and energy efficiency when considering custom hardware designs in a commercial 28 nm CMOS technology.

2. RELATED WORK AND BACKGROUND

2.1 NETWORK QUANTIZATION Network quantization aims at accelerating inference by using low-precision arithmetic. In its most extreme form, weights and activations are both quantized using binary or ternary quantizers. The binary quantizer Q b corresponds to the sign function, whereas the ternary quantizer Q t maps some values to zero. Multiplications in binarized or ternarized networks (Hubara et al., 2016; Courbariaux et al., 2015; Lin et al., 2017; Rastegari et al., 2016; Zhu et al., 2016 ) can be implemented using bitwise logic, leading to impressive acceleration. However, training such networks is challenging since fewer than 2 bits are used to represent activations and weights, resulting in a dramatic impact on accuracy compared to full-precision models. Binary and ternary networks are generalized to higher precision via uniform quantization, which has been shown to result in efficient hardware (Jacob et al., 2018) . The multi-bit uniform quantizer Q u is given by: Q u (x) = round(x/∆ x )∆ x , where ∆ x denotes the quantization step-size. The output of the quantizer is a floating-point number x that can be expressed as x = ∆ x x q , where x q is the fixed-point representation of x. The fixed-point number x q has a "precision" or "bitwidth," which is the number of bits used to represent it. Note that the range of floating-point numbers representable by the uniform quantizer Q u depends on both the quantization step-size ∆ x and the quantization precision. Nonetheless, the number of different values that can be represented by the same quantizer depends only on the precision. Applying uniform quantization to both weights w = ∆ w w q and activations x = ∆ x x q simplifies computations, as an inner-product simply becomes z = i w i x i = i (∆ w (w q ) i )(∆ x (x q ) i ) = (∆ w ∆ x ) i (w q ) i (x q ) i = ∆ z z q . (1) The key advantage of uniform quantization is that the core computation i (w q ) i (x q ) i can be carried out using fixed-point (i.e., integer) arithmetic only. Results in (Gong et al., 2019; Choi et al., 2018b; Jung et al., 2019; Wang et al., 2019; Mishra et al., 2017; Mishra & Marr, 2017) have shown that high classification accuracy is attainable with low-bitwidth uniform quantization, such as 2 or 3 bits. Although (w q ) i , (x q ) i , and their product may have extremely low-precision, the accumulated result z q of many of these products has very high dynamic range. As a result, high-precision accumulators are typically required to avoid overflows, which is the bottleneck for further arithmetic speedups.

2.2. LOW-PRECISION ACCUMULATION

Several approaches have been proposed that use accumulators with fewer bits to obtain speed-ups. For example, reference (Khudia et al., 2021) splits the weights into two separate matrices, one with

