TWINDNN: A TALE OF TWO DEEP NEURAL NET-WORKS

Abstract

Compression technologies for deep neural networks (DNNs), such as weight quantization, have been widely investigated to reduce the DNN model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme, with a high accuracy but slower DNN coupled with its highly compressed DNN version that typically delivers much faster inference speed but with a lower accuracy. During inference, we determine the confidence of the prediction of the compressed DNN, and infer the original DNN for the inputs that are considered not confident by the compressed DNN. The proposed design can deliver overall accuracy close to the high accuracy model, but with the latency closer to the compressed DNN. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% increase in throughput compared to just using the original DNN. This is the first work that considers using a highly compressed DNN along with the original DNN in parallel to improve latency significantly while effectively maintaining the original model accuracy.

1. INTRODUCTION

Machine learning is one of the most popular fields in the current era. It is used in various ways, such as speech recognition, face recognition, medical diagnosis, etc. However, the problem is that the neural networks for machine learning applications Krizhevsky et al. (2012); He et al. (2016a; b) are becoming too large and slow to be on a small chip for real-time systems. As a result, there has been a significant amount of research to reduce the size of the neural networks so that their inference latencies are low enough to handle real-time inputs Zhang et al. (2020); Zhou et al. (2016); Zhang et al. (2018) . There are quite a few approaches to compress existing neural networks, but for field-programmable gate arrays (FPGAs), quantization of network is the most popular and effective method to reduce the size and inference latency at (2018) . These networks provide an extra benefit that normal quantized networks do not provide in terms of multiplier (DSP) utilization. The idea is that extremely low bit-width weights allow multiplications to be done in conditional logic, which can be implemented by logic gates, without using a special hardware for multiplication (DSP). This fact can allow developers to utilize additional DSPs in other ways where DSPs can be useful. However, this benefit is not free, of course. One major downside of these low bit-width networks is that they tend to have even more accuracy drop than regular quantized neural networks, as a result of further reduced precision. Therefore, it is more difficult to use binary or ternary neural networks as they are, especially in the fields such as surveillance or medical diagnosis systems, where the cost of that accuracy drop is much larger than the latency improvement. The goal of this study is to accelerate neural network inferences by using an extremely low bitwidth network implementations on FPGAs, while maintaining the accuracy of the original network by using relatively high precision network concurrently, without having to develop a single DNN accelerator that meets both accuracy and latency requirements. The main contribution is to find a



the same time Han et al. (2016). In particular, extremely low bit-width networks on FPGAs, such as binary or ternary neural networks have been studied recently Wang et al. (2018); Courbariaux & Bengio (2016); Courbariaux et al. (2015); Zhao et al. (2017); Yao Chen & Chen (2019); Li & Liu (2016); Blott et al.

