PARETO RANK-PRESERVING SUPERNETWORK FOR HW-NAS

Abstract

In neural architecture search (NAS), training every sampled architecture is very time-consuming and should be avoided. Weight-sharing is a promising solution to speed up the evaluation process. However, a sampled subnetwork is not guaranteed to be estimated precisely unless a complete individual training process is done. Additionally, practical deep learning engineering processes require incorporating realistic hardware-performance metrics into the NAS evaluation process, also known as hardware-aware NAS (HW-NAS). HW-NAS results in a Pareto front, a set of all architectures that optimize conflicting objectives, i.e. taskspecific performance and hardware efficiency. This paper proposes a supernetwork training methodology that preserves the Pareto ranking between its different subnetworks resulting in more efficient and accurate neural networks for a variety of hardware platforms. The results show a 97% near Pareto front approximation in less than 2 GPU days of search, which provides x2 speed up compared to stateof-the-art methods. We validate our methodology on NAS-Bench-201, DARTS and ImageNet. Our optimal model achieves 77.2% accuracy (+1.7% compared to baseline) with an inference time of 3.68ms on Edge GPU for ImageNet.

1. INTRODUCTION

A key element in solving real-world deep learning (DL) problems is the optimal selection of the sequence of operations and their hyperparameters, called DL architecture. 2021)). These strategies are evaluated on how well they respect the ground truth ranking between the architectures in the search space. Weight sharing is an estimation strategy that formulates the search space into a supernetwork. A supernetwork is an over-parameterized architecture where each path can be sampled. At the end of this sampling, a sub-network of the supernetwork is obtained. In each layer, all possible operations are trained. With this definition, we can classify weight-sharing NAS in two categories: (1)a twostage NAS in which we first train the supernetwork on the targeted task. Then, using the pre-trained supernetwork, each sampled sub-network's performance can be estimated using a search strategy, such as an evolutionary algorithm. (2) a one-stage NAS in which we simultaneously search and



Neural architecture search (NAS) (Santra et al. (2021)) automates the design of DL architectures by searching for the best architecture within a set of possible architectures, called search space. When considering hardware constraints, hardware-aware neural architecture search (Benmeziane et al. (2021); Sekanina (2021)) (HW-NAS) simultaneously optimizes the task-specific performance, such as accuracy, and the hardware efficiency computed by the latency, energy consumption, memory occupancy, and chip area. HW-NAS works (Cai et al. (2019); Lin et al. (2021); Wang et al. (2022)) showed the usefulness and discovered state-of-the-art architectures for Image Classification (Lin et al. (2021)), Object detection (Chen et al. (2019)), and Keyword spotting (Busia et al. (2022)). HW-NAS is cast as a multi-objective optimization problem. Techniques for HW-NAS span evolutionary search, Bayesian optimization, reinforcement learning and gradient-based methods. These require evaluating each sampled architecture on the targeted task and hardware platform. However, the evaluation is extremely time-consuming, especially for task-specific performance, which requires training in the architecture. Many estimation strategies (White et al. (2021)) are used to alleviate this problem, such as neural predictor methods (Benmeziane et al. (2022a); Ning et al. (2020)), zero-cost learning (Lopes et al. (2021); Abdelfattah et al. (2021)), and weight sharing (Chu et al. (2021); Chen et al. (

