PARETO RANK-PRESERVING SUPERNETWORK FOR HW-NAS

Abstract

In neural architecture search (NAS), training every sampled architecture is very time-consuming and should be avoided. Weight-sharing is a promising solution to speed up the evaluation process. However, a sampled subnetwork is not guaranteed to be estimated precisely unless a complete individual training process is done. Additionally, practical deep learning engineering processes require incorporating realistic hardware-performance metrics into the NAS evaluation process, also known as hardware-aware NAS (HW-NAS). HW-NAS results in a Pareto front, a set of all architectures that optimize conflicting objectives, i.e. taskspecific performance and hardware efficiency. This paper proposes a supernetwork training methodology that preserves the Pareto ranking between its different subnetworks resulting in more efficient and accurate neural networks for a variety of hardware platforms. The results show a 97% near Pareto front approximation in less than 2 GPU days of search, which provides x2 speed up compared to stateof-the-art methods. We validate our methodology on NAS-Bench-201, DARTS and ImageNet. Our optimal model achieves 77.2% accuracy (+1.7% compared to baseline) with an inference time of 3.68ms on Edge GPU for ImageNet.

1. INTRODUCTION

A key element in solving real-world deep learning (DL) problems is the optimal selection of the sequence of operations and their hyperparameters, called DL architecture. Neural architecture search (NAS) (Santra et al. (2021) ) automates the design of DL architectures by searching for the best architecture within a set of possible architectures, called search space. When considering hardware constraints, hardware-aware neural architecture search (Benmeziane et al. (2021) ; Sekanina ( 2021)) (HW-NAS) simultaneously optimizes the task-specific performance, such as accuracy, and the hardware efficiency computed by the latency, energy consumption, memory occupancy, and chip area. 2021)). These strategies are evaluated on how well they respect the ground truth ranking between the architectures in the search space. Weight sharing is an estimation strategy that formulates the search space into a supernetwork. A supernetwork is an over-parameterized architecture where each path can be sampled. At the end of this sampling, a sub-network of the supernetwork is obtained. In each layer, all possible operations are trained. With this definition, we can classify weight-sharing NAS in two categories: (1)a twostage NAS in which we first train the supernetwork on the targeted task. Then, using the pre-trained supernetwork, each sampled sub-network's performance can be estimated using a search strategy, such as an evolutionary algorithm. (2) a one-stage NAS in which we simultaneously search and train the supernetwork. Additional parameters are assigned to each possible operation per layer. These parameters are trained to select which operation is appropriate for each layer. Both Weight-sharing categories assume that the rank between different sub-networks is preserved. Two architectures with the same rank imply that they have the same accuracy. State-of-the-art works (Zhang et al. ( 2020 2021)) have highlighted the training inefficiency in this approach by computing the ranking correlation between the architectures' actual rankings and the estimated rankings. Some solutions have been proposed to train the supernetwork with strict constraints on fairness to preserve the ranking for accuracy, such as FairNAS (Chu et al. (2021) ). Others train a graph convolutional network in parallel to fit the performance of sampled sub-networks Chen et al. ( 2021). However, current solutions have two main drawbacks: 1. In the multi-objective context of HW-NAS, different objectives such as accuracy and latency have to be estimated. The result is a Pareto front, a set of architectures that better respects the trade-off between the conflicting objectives. The ranking following one objective is no longer a good metric for the estimator. In this setting, we need to take into account the dominance concept in the ranking. Both estimations hinder the final Pareto front approximation and affect the search exploration when considering the accuracy and latency as objectives. 2020)) attempt to fix the supernetwork sampling after its training. We believe that this strategy is inefficient due to the pre-training of supernetwork. Its accuracy-based ranking correlation is bad. In Dong & Yang (2020), a reduced Kendall's tau-b rank correlation coefficient of 0.47 has been obtained on NAS-Bench-201 when using this approach. The accuracy estimation is thus non-conclusive and will mislead any NAS search strategy. To overcome the aforementioned issues, we propose a new training methodology for supernetworks to preserve the Pareto ranking of sub-networks in HW-NAS and avoid additional ranking correction steps. The contributions of this paper are summarized as follows: • We define the Pareto ranking as a novel metric to compare HW-NAS evaluator in the multi-objective context. Our study shows that optimizing this metric while training the supernetwork increases the Kendall rank correlation coefficient from 0.47 to 0.97 for a Vanilla Weight-sharing NAS. • We introduce a novel one-stage weight-sharing supernetwork training methodology. The training optimizes the task-specific loss function (e.g. cross-entropy loss) and a Pareto ranking listwise loss function to select the adequate operation per layer accurately. • During training, we prune the operations that are the least likely to be in the architecture of the optimal Pareto front. The pruning is done by overlapping the worst Paretoranked sub-networks and removing the operations that are only used in these sub-networks. We demonstrate that using our methodology on three different  )), we achieve a higher Pareto front approximation compared to current state-of-the-art methods. For example, we obtained 97% Pareto front approximation when One-Shot-NAS-GCN (Chen et al. ( 2021)) depicts only 87% on NAS-Bench-201.

2. BACKGROUND & RELATED WORK

This section summarizes the state-of-the-art in accelerating multi-objective optimization HW-NAS. 



HW-NAS works (Cai et al. (2019); Lin et al. (2021); Wang et al. (2022)) showed the usefulness and discovered state-of-the-art architectures for Image Classification (Lin et al. (2021)), Object detection (Chen et al. (2019)), and Keyword spotting (Busia et al. (2022)). HW-NAS is cast as a multi-objective optimization problem. Techniques for HW-NAS span evolutionary search, Bayesian optimization, reinforcement learning and gradient-based methods. These require evaluating each sampled architecture on the targeted task and hardware platform. However, the evaluation is extremely time-consuming, especially for task-specific performance, which requires training in the architecture. Many estimation strategies (White et al. (2021)) are used to alleviate this problem, such as neural predictor methods (Benmeziane et al. (2022a); Ning et al. (2020)), zero-cost learning (Lopes et al. (2021); Abdelfattah et al. (2021)), and weight sharing (Chu et al. (2021); Chen et al. (

); Peng et al. (2021); Zhao et al. (

2. Many works (Chen et al. (2021); Zhao et al. (2021); Guo et al. (

search spaces, namely NAS-Bench-201 (Dong & Yang (2020)), DARTS (Liu et al. (2019)) and ProxylessNAS search space (Cai et al. (

ACCELERATING HARDWARE-AWARE NAS Given a target hardware platform and a DL task, Hardware-aware Neural Architecture Search (HW-NAS) (Benmeziane et al. (2021)) automates the design of efficient DL architectures. HW-NAS is a multi-objective optimization problem where different and contradictory objectives, such as accuracy, latency, energy consumption, memory occupancy, and chip area, have to be optimized. HW-NAS has three main components: (1) the search space ,(2) the evaluation method and (3) the

