NAHAS: NEURAL ARCHITECTURE AND HARDWARE ACCELERATOR SEARCH

Abstract

Neural architectures and hardware accelerators have been two driving forces for the rapid progress in deep learning. Although previous works have optimized either neural architectures given fixed hardware, or hardware given fixed neural architectures, none has considered optimizing them jointly. In this paper, we study the importance of co-designing neural architectures and hardware accelerators. To this end, we propose NAHAS, an automated hardware design paradigm that jointly searches for the best configuration for both neural architecture and accelerator. In NAHAS, accelerator hardware design is conditioned on the dynamically explored neural networks for the targeted application, instead of fixed architectures, thus providing better performance opportunities. Our experiments with an industry-standard edge accelerator show that NAHAS consistently outperforms previous platform-aware neural architecture search and state-of-theart EfficientNet on all latency targets by 0.5% -1% ImageNet top-1 accuracy, while reducing latency by about 20%. Joint optimization reduces the search samples by 2x and reduces the latency constraint violations from 3 violations to 1 violation per 4 searches, compared to independently optimizing the two sub spaces.

1. INTRODUCTION

Conventional hardware design has been driven by benchmarks (e.g. SPEC (SPE)) where a selected set of workloads are evaluated and the average performance is optimized. For example, CPUs are optimized for sequential workloads such as desktop applications and GPUs are designed for massive parallel workloads such as gaming, graphics rendering, scientific computing, etc. Generalization over a wide set of representative workloads is the traditional method for hardware design and optimization. However, the selected workloads can stay fixed for a substantially long time, which makes the hardware design lag behind the algorithmic changes. As a result of hitting the end of Moore's Law in the recent decade, the focus has switched to hardware specialization to provide additional speedups and efficiency for a narrowed-down application or domain. Google's TPU (TPU) and Intel's Nirvana NNP (Yang, 2019) are two representative accelerators specialized for deep learning primitives and MLPerf (MLP) has become prevalent for benchmarking the state-of-the-art design of ML accelerators. However, rapid progress in deep learning has given birth to numerous more powerful, expressive, and efficient models in a short time, which results in both benchmarking and accelerator development lagging behind. For example, squeeze-and-excite with global pooling and SiLU/Swish non-linearity (Ramachandran et al., 2017; Elfwing et al., 2018) are found to be useful in EfficientNet (Tan & Le, 2019), however, neither can currently execute efficiently even on a highly specialized accelerator. We need to evolve the accelerator design more rapidly. On the other hand, platform-aware neural architecture search (Tan et al., 2019; Wu et al., 2019; Cai et al., 2018) optimizes the neural architectures for a target inference device. The target device has a fixed hardware configuration that can significantly limit NAS flexibility and performance. For example, the target device may have a sub-optimal compute and memory ratio for the target application combined with a inference latency target, which can shift the optimal NAS model distributions and result in underperformance.

