TRIPLE-SEARCH: DIFFERENTIABLE JOINT-SEARCH OF NETWORKS, PRECISION, AND ACCELERATORS

Abstract

The record-breaking performance and prohibitive complexity of deep neural networks (DNNs) have ignited a substantial need for customized DNN accelerators which have the potential to boost DNN acceleration efficiency by orders-ofmagnitude. While it has been recognized that maximizing DNNs' acceleration efficiency requires a joint design/search for three different yet highly coupled aspects, including the networks, adopted precision, and their accelerators, the challenges associated with such a joint search have not yet been fully discussed and addressed. First, to jointly search for a network and its precision via differentiable search, there exists a dilemma of whether to explode the memory consumption or achieve sub-optimal designs. Second, a generic and differentiable joint search of the networks and their accelerators is non-trivial due to (1) the discrete nature of the accelerator space and (2) the difficulty of obtaining operation-wise hardware cost penalties because some accelerator parameters are determined by the whole network. To this end, we propose a Triple-Search (TRIPS) framework to address the aforementioned challenges towards jointly searching for the network structure, precision, and accelerator in a differentiable manner, to efficiently and effectively explore the huge joint search space. Our TRIPS addresses the first challenge above via a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and tackles the latter one using a novel co-search pipeline that integrates a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both TRIPS generated networks and accelerators consistently outperform state-of-the-art (SOTA) designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. All codes will be released upon acceptance.

1. INTRODUCTION

The powerful performance and prohibitive complexity of deep neural networks (DNNs) have fueled a tremendous demand for efficient DNN accelerators which could boost DNN acceleration efficiency by orders-of-magnitude (Chen et al., 2016) . In response, extensive research efforts have been devoted to developing DNN accelerators. Early works decouple the design of efficient DNN algorithms and their accelerators. On the algorithms level, pruning, quantization, or neural architecture search (NAS) are adopted to trim down the model complexity; On the hardware level, various FPGA-/ASIC-based accelerators have been developed to customize the micro-architectures (e.g., processing elements dimension, memory sizes, and network-on-chip design) and algorithm-to-hardware mapping methods (e.g., loop tiling strategies and loop orders) in order to optimize the acceleration efficiency for a given DNN. Later, hardware-aware NAS (HA-NAS) has been developed to further improve DNNs' acceleration efficiency for different applications (Tan et al., 2019) . More recently, it has been recognized that (1) optimal DNN accelerators require a joint consideration/search for all the following different yet coupled aspects, including DNNs' network structure, the adopted precision, and their accelerators' micro-architecture and mapping methods, and (2) merely exploring a subset of these aspects will lead to sub-optimal designs in terms of hardware efficiency or task accuracy. For example, the optimal accelerators for networks with different structures (e.g., width, depth, and kernel size) can be very different; while the optimal networks and their bitwidths for different accelerators can differ a lot (Wu et al., 2019) . However, the direction of jointly designing or searching for all the three aspects has only been slightly touched on previously. For example, (Chen et al., 2018; Gong et al., 2019; Wang et al., 2020) proposed to jointly search for the structure and precision of DNNs for a fixed target hardware; (Abdelfattah et al., 2020; Yang et al., 2020; Jiang et al., 2020a; b) made the first attempt to jointly search for the networks and their accelerators, yet either their network or accelerator choices are limited due to the prohibitive time cost required by their adopted reinforcement learning (RL) based methods; and EDD (Li et al., 2020) contributed a pioneering effort towards this direction by formulating a differentiable joint search framework, which however only consider one single accelerator parameter (i.e., parallel factor) and more importantly, has not yet fully solved the challenges of such joint search. Although differentiable search is one of the most promising ways in terms of search efficiency to explore the huge joint search space as discussed in Sec. 4.2, plethora of challenges exist to achieve an effective generic joint search for the aforementioned three aspects. First, Challenge 1: to jointly search for a network and its precision via differentiable search, there exists a dilemma whether to activate all the paths during search. On one hand, the required memory consumption can easily explode and thus constrain the search's scalability to more complex tasks if all paths are activated; on the other hand, partially activating a subset of the paths can lead to a sequential training of different precision on the same weights, which might result in inaccurate accuracy ranking among different precision as discussed in (Jin et al., 2020) . Second, Challenge 2: the accelerators' parameters are not differentiable, and it is non-trivial to derive the operation-wise hardware-cost penalty in order to perform differentiable search (in considering search efficiency). This is because the optimal accelerator is often determined by the whole network instead of one specific operation/layer due to the fact that some accelerator parameters (e.g., the loop order) need to be optimized for the whole network. In this paper, we aim to address the aforementioned challenges towards scalable generic joint search for the network, precision, and accelerator. Specifically, we make the following contributions: • We propose a Triple-Search (TRIPS) framework to jointly search for the network, precision, and accelerator in a differentiable manner to efficiently explore the huge joint search space which cannot be afforded by previous RL-based methods due to their required prohibitive search cost. TRIPS identifies and tackles the aforementioned challenges towards scalable generic joint search of the three for maximizing both the accuracy and acceleration efficiency. • We develop a heterogeneous sampling strategy for simultaneous updating the weights and network structures to (1) avoid the need to sequentially train different precision and (2) achieve unbiased search with constant memory consumption, i.e., solve the above Challenge 1. In addition, we develop a novel co-search pipeline that integrates a differentiable hardware search engine to address the above Challenge 2. • Extensive experiments and ablation studies validate the effectiveness of our proposed TRIPS framework in terms of the resulting search time, task accuracy, and accelerator efficiency, when benchmarked over state-of-the-art (SOTA) co-search/exploration techniques, HA-NAS methods, and DNN accelerators. Furthermore, we visualize the searched accelerators by TRIPS to provide insights towards efficient DNN accelerator design in Appendix.

2. RELATED WORKS

Hardware-aware NAS. Hardware-aware NAS has been proposed to automate the design of efficient DNNs. Early works (Tan et al., 2019; Howard et al., 2019; Tan & Le, 2019) utilize RL-based NAS that requires a massive search time/cost, while recent works (Wu et al., 2019; Wan et al., 2020; Cai et al., 2018; Stamoulis et al., 2019) explore the design space in a differentiable way (Liu et al., 2018) with much improved searching efficiency. Along another direction, one-shot NAS methods (Cai et al., 2019; Guo et al., 2020; Yu et al., 2020) pretrain the supernet and directly evaluate the performances of the sub-networks in a weight sharing manner as a proxy of their independently trained performances at the cost of a longer pretrain time. In addition, NAS has been adopted to search for quantization strategies (Wang et al., 2019; Wu et al., 2018; Cai & Vasconcelos, 2020; Elthakeb et al., 2020) for trimming down the complexity of a given DNN. However, these works leave unexplored the

