TWINDNN: A TALE OF TWO DEEP NEURAL NET-WORKS

Abstract

Compression technologies for deep neural networks (DNNs), such as weight quantization, have been widely investigated to reduce the DNN model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme, with a high accuracy but slower DNN coupled with its highly compressed DNN version that typically delivers much faster inference speed but with a lower accuracy. During inference, we determine the confidence of the prediction of the compressed DNN, and infer the original DNN for the inputs that are considered not confident by the compressed DNN. The proposed design can deliver overall accuracy close to the high accuracy model, but with the latency closer to the compressed DNN. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% increase in throughput compared to just using the original DNN. This is the first work that considers using a highly compressed DNN along with the original DNN in parallel to improve latency significantly while effectively maintaining the original model accuracy.

1. INTRODUCTION

Machine learning is one of the most popular fields in the current era. It is used in various ways, such as speech recognition, face recognition, medical diagnosis, etc. However, the problem is that the neural networks for machine learning applications Krizhevsky et al. (2012) ; He et al. (2016a; b) are becoming too large and slow to be on a small chip for real-time systems. As a result, there has been a significant amount of research to reduce the size of the neural networks so that their inference latencies are low enough to handle real-time inputs Zhang et al. (2020); Zhou et al. (2016); Zhang et al. (2018) . There are quite a few approaches to compress existing neural networks, but for field-programmable gate arrays (FPGAs), quantization of network is the most popular and effective method to reduce the size and inference latency at the same time Han et al. (2016) Blott et al. (2018) . These networks provide an extra benefit that normal quantized networks do not provide in terms of multiplier (DSP) utilization. The idea is that extremely low bit-width weights allow multiplications to be done in conditional logic, which can be implemented by logic gates, without using a special hardware for multiplication (DSP). This fact can allow developers to utilize additional DSPs in other ways where DSPs can be useful. However, this benefit is not free, of course. One major downside of these low bit-width networks is that they tend to have even more accuracy drop than regular quantized neural networks, as a result of further reduced precision. Therefore, it is more difficult to use binary or ternary neural networks as they are, especially in the fields such as surveillance or medical diagnosis systems, where the cost of that accuracy drop is much larger than the latency improvement. The goal of this study is to accelerate neural network inferences by using an extremely low bitwidth network implementations on FPGAs, while maintaining the accuracy of the original network by using relatively high precision network concurrently, without having to develop a single DNN accelerator that meets both accuracy and latency requirements. The main contribution is to find a mechanism to choose the right network to infer for specific inputs, and this is done by creating a hierarchical structure of two different compressed networks and utilizing the output of the initial inference to determine the need of extra verification. In summary, we propose a system that consists of two distinct networks: one extremely low bit-width network that is focused on latency, and the other moderately quantized network that is focused on accuracy. In this paper, extremely low bit-width network will be called a compressed network, and moderately quantized network will be called an original network. These two networks work in a way that can exploit advantages in both latency and accuracy at the same time. The overall mechanism is similar to the one presented in Mocerino & Calimera (2014). However, there has not been any study of this concept in FPGA accelerators for deep neural networks. This represents a novel direction in neural network research, to pair compressed network and original network in parallel in order to improve latency while maintaining the original accuracy. Our main contributions are as follows: • Accelerators that are designed and optimized to exploit low bit-width networks, with pipelined and parallelized computation engines that use a minimum number of DSPs as possible. • A software solution that allows two accelerators to be run in hierarchical fashion, utilizing confidence of a compressed network prediction. • For ImageNet and ResNet-18, our TwinDNN solution can deliver up to 1.9× latency improvement with only 3% of extra DSPs used for compressed network, and up to 95% of accuracy loss is recovered during hierarchical inference. In Section 2, some background information related to this work will be introduced. In Section 3, design flow of our implementation and experiment will be explained. In Section 4, the results of our experiments will be described. Section 5 will conclude the paper with future explorations. Many experiments claim that these compression methods are very effective in terms of latency reduction with some accuracy drops. As one would expect, as the number of bits used to represent either weights or feature maps decreases, accuracy drops more significantly.

2. BACKGROUND

Because the goal of our study is to compensate for the accuracy loss caused by compression, we can forgive moderate accuracy loss, as long as the benefit of using those networks is significant. The following equations: w b = -w scale if b = 0 +w scale if b = 1 w t = -w scale if t = -1 +w scale if t = 1 0 if t = 0 show how these extremely low bit-width weights are used in computation. b is a 1-bit value that can be either 0 or 1, and t is a 2-bit value that can take either -1, 0, or 1. The key idea here is that w scale value is the same across the weights. The bits are only used in sign representations. In binary, as an example, a single bit of 0 represents negative and 1 represents positive, and this logic can be implemented in a simple condition, or a multiplexer in FPGAs. w scale value is stored separately, and the same w scale value is multiplied over all binary weights to get the actual weight values. However, we do not need to perform all of these multiplications separately. Consider b1 = 0 and b2 = 1 for the binary case. a stands for activation, or feature map, then we can express a very simple neural network computation as follows:



EXTREMELY LOW BIT-WIDTH NEURAL NETWORKS Recent researches have succeeded in binarizing or ternarizing parts of layers in neural networks Wang et al. (2018); Courbariaux & Bengio (2016); Courbariaux et al. (2015); Zhao et al. (2017); Yao Chen & Chen (2019).

