ONCE QUANTIZED FOR ALL: PROGRESSIVELY SEARCHING FOR QUANTIZED COMPACT MODELS

Abstract

Automatic search of Quantized Neural Networks (QNN) has attracted a lot of attention. However, the existing quantization-aware Neural Architecture Search (NAS) approaches inherit a two-stage search-retrain schema, which is not only time-consuming but also adversely affected by the unreliable ranking of architectures during the search. To avoid the undesirable effect of the search-retrain schema, we present Once Quantized for All (OQA), a novel framework that searches for quantized compact models and deploys their quantized weights at the same time without additional post-process. While supporting a huge architecture search space, our OQA can produce a series of quantized compact models under ultra-low bit-widths(e.g. 4/3/2 bit). A progressive bit inheritance procedure is introduced to support ultra-low bit-width. Our searched model family, OQANets, achieves a new state-of-the-art (SOTA) on quantized compact models compared with various quantization methods and bit-widths. In particular, OQA2bit-L achieves 64.0% ImageNet Top-1 accuracy, outperforming its 2 bit counterpart EfficientNet-B0@QKD by a large margin of 14% using 30% less computation cost.

1. INTRODUCTION

Compact architecture design (Sandler et al., 2018; Ma et al., 2018) and network quantization methods (Choi et al., 2018; Kim et al., 2019; Esser et al., 2019) are two promising research directions to deploy deep neural networks on mobile devices. Network quantization aims at reducing the number of bits for representing network parameters and features. On the other hand, Neural Architecture Search(NAS) (Howard et al., 2019; Cai et al., 2019; Yu et al., 2020) is proposed to automatically search for compact architectures, which avoids expert efforts and design trials. In this work, we explore the ability of NAS in finding quantized compact models and thus enjoy merits from two sides. Traditional combination of NAS and quantization methods could either be classified to NASthen-Quantize or Quantization-aware NAS as shown in Figure 1 . Conventional quantization methods merely compress the off-the-shelf networks, regardless of whether it is searched (EfficientNet (Tan & Le, 2019)) or handcrafted (MobileNetV2 (Sandler et al., 2018) ). These methods correspond to NAS-then-Quantize approach as shown in Figure 1(a) . However, it is not optimal because the accuracy rank among the searched floating-point models would change after they are quantized. Thus, this traditional routine may fail to get a good quantized model. Directly search with quantized models' performance seems to be a solution. Existing quantization-aware NAS methods (Wang et al., 2019; Shen et al., 2019; Bulat et al., 2020; Guo et al., 2019; Wang et al., 2020) utilize a two-stage search-retrain schema as shown in Figure 1(b) . Specifically, they first search for one architecture under one bit-width settingfoot_0 , and then retrain the model under the given bit-width. This two-stage procedure undesirably increases the search and retrain cost if we have multiple deployment constraints and hardware bit-widths. Furthermore, due to the instability brought by quantization-aware training, simply combining quantization and NAS results in unreliable ranking (Li et al., 2019a; Guo et al., 2019) and sub-optimal



One bit-width setting refers to a specific bit-width for each layer, where different layers could have different bit-widths.

