NAHAS: NEURAL ARCHITECTURE AND HARDWARE ACCELERATOR SEARCH

Abstract

Neural architectures and hardware accelerators have been two driving forces for the rapid progress in deep learning. Although previous works have optimized either neural architectures given fixed hardware, or hardware given fixed neural architectures, none has considered optimizing them jointly. In this paper, we study the importance of co-designing neural architectures and hardware accelerators. To this end, we propose NAHAS, an automated hardware design paradigm that jointly searches for the best configuration for both neural architecture and accelerator. In NAHAS, accelerator hardware design is conditioned on the dynamically explored neural networks for the targeted application, instead of fixed architectures, thus providing better performance opportunities. Our experiments with an industry-standard edge accelerator show that NAHAS consistently outperforms previous platform-aware neural architecture search and state-of-theart EfficientNet on all latency targets by 0.5% -1% ImageNet top-1 accuracy, while reducing latency by about 20%. Joint optimization reduces the search samples by 2x and reduces the latency constraint violations from 3 violations to 1 violation per 4 searches, compared to independently optimizing the two sub spaces.

1. INTRODUCTION

Conventional hardware design has been driven by benchmarks (e.g. SPEC (SPE)) where a selected set of workloads are evaluated and the average performance is optimized. For example, CPUs are optimized for sequential workloads such as desktop applications and GPUs are designed for massive parallel workloads such as gaming, graphics rendering, scientific computing, etc. Generalization over a wide set of representative workloads is the traditional method for hardware design and optimization. However, the selected workloads can stay fixed for a substantially long time, which makes the hardware design lag behind the algorithmic changes. As a result of hitting the end of Moore's Law in the recent decade, the focus has switched to hardware specialization to provide additional speedups and efficiency for a narrowed-down application or domain. Google's TPU (TPU) and Intel's Nirvana NNP (Yang, 2019) are two representative accelerators specialized for deep learning primitives and MLPerf (MLP) has become prevalent for benchmarking the state-of-the-art design of ML accelerators. However, rapid progress in deep learning has given birth to numerous more powerful, expressive, and efficient models in a short time, which results in both benchmarking and accelerator development lagging behind. For example, squeeze-and-excite with global pooling and SiLU/Swish non-linearity (Ramachandran et al., 2017; Elfwing et al., 2018) are found to be useful in EfficientNet (Tan & Le, 2019), however, neither can currently execute efficiently even on a highly specialized accelerator. We need to evolve the accelerator design more rapidly. On the other hand, platform-aware neural architecture search (Tan et al., 2019; Wu et al., 2019; Cai et al., 2018) optimizes the neural architectures for a target inference device. The target device has a fixed hardware configuration that can significantly limit NAS flexibility and performance. For example, the target device may have a sub-optimal compute and memory ratio for the target application combined with a inference latency target, which can shift the optimal NAS model distributions and result in underperformance. We propose NAHAS, a new paradigm of software and hardware co-design, by parameterizing neural architecture search and hardware accelerator search in a unified joint search space. We use a highly parameterized industry-standard ML accelerator as our target device, which has a tunable set of important hardware parameters. These knobs fundamentally determine hardware characteristics such as number of compute units, amount of parallelism, compute to memory ratio, bandwidth, etc., which we found very critical to model performance. We formulate the optimization problem as a bi-level optimization with hardware resource constraints on chip area and model latency. Unlike conventional hardware optimization, NAHAS is a task driven approach, where the task is a problem (e.g. image classification, object detection) or a domain of problems (e.g. vision, NLP), not a set of fixed programs or graphs (e.g. ResNet, Transformers). This effectively creates generalization across the vertical stack, making the hardware evolve with the applications. NAHAS can be practically used to design customized accelerators for autonomous driving and mobile SoC (system-on-chip) where a set of highly optimized accelerators are combined into a system. We also propose a latency-driven optimization that maximizes model accuracy while meeting a latency constraint under a chip area budget. Conventional platform-aware NAS typically focuses on searching efficient NAS models with higher accuracy, lower parameters and FLOPs (number of multiply-add operations). However, optimizing for lower parameters and FLOPs is not necessarily good for performance (Wu et al., 2019) . For example, a multi-branch network like NasNet (Zoph et al., 2017) has lower parameters and FLOPs compared to layer-wise network such as ResNet and MobileNet, but its fragmented cell-level structure is not hardware friendly and can be evaluated very slowly on the target device. As a matter of fact, the number of parameters and FLOPs are only indirect metrics that could affect a direct metric such as latency and power negatively if not considered. For example, the model can run out of memory if the number of parameters is too large. However, a direct optimization on indirect metrics won't necessarily improve the direct metrics. Figure 1 shows a motivating example of using joint search and a high level workflow of NAHAS. While conventional platform-aware NAS selects models along the pareto frontier with different latency and accuracy tradeoffs for one target device, as indicated in Figure 1a left, NAHAS further expands the pareto frontier by enabling different hardware accelerator configurations. To summarize our contributions: • We develop a fully automated framework that can jointly optimize neural architectures and hardware accelerators. For the first time, we demonstrate the effectiveness of cooptimizing neural architecture search with the parameterization of a highly optimized industry-standard accelerator. 



Figure 1: Motivating example (a) and high level work flow (b). In (a) different accelerator configurations have different pareto frontiers consisting of different NAS models (left) and joint search effectively extends the pareto frontier by joining multiple frontiers (right).

We propose a latency-driven search method, which is hardware-agnostic and achieves state-of-the-art results across multiple search spaces. NAHAS outperforms MnasNet and EfficientNet on all latency targets by 0.5%-1% in ImageNet accuracy and 20% in latency.• We observe that different model sizes combined with different latency targets require completely different hardware accelerator configurations. Customizing accelerators for different model sizes and latency targets becomes essential when co-designing neural architectures and accelerators for domains such as autonomous driving.

