ONCE QUANTIZED FOR ALL: PROGRESSIVELY SEARCHING FOR QUANTIZED COMPACT MODELS

Abstract

Automatic search of Quantized Neural Networks (QNN) has attracted a lot of attention. However, the existing quantization-aware Neural Architecture Search (NAS) approaches inherit a two-stage search-retrain schema, which is not only time-consuming but also adversely affected by the unreliable ranking of architectures during the search. To avoid the undesirable effect of the search-retrain schema, we present Once Quantized for All (OQA), a novel framework that searches for quantized compact models and deploys their quantized weights at the same time without additional post-process. While supporting a huge architecture search space, our OQA can produce a series of quantized compact models under ultra-low bit-widths(e.g. 4/3/2 bit). A progressive bit inheritance procedure is introduced to support ultra-low bit-width. Our searched model family, OQANets, achieves a new state-of-the-art (SOTA) on quantized compact models compared with various quantization methods and bit-widths. In particular, OQA2bit-L achieves 64.0% ImageNet Top-1 accuracy, outperforming its 2 bit counterpart EfficientNet-B0@QKD by a large margin of 14% using 30% less computation cost.

1. INTRODUCTION

Compact architecture design (Sandler et al., 2018; Ma et al., 2018) and network quantization methods (Choi et al., 2018; Kim et al., 2019; Esser et al., 2019) are two promising research directions to deploy deep neural networks on mobile devices. Network quantization aims at reducing the number of bits for representing network parameters and features. On the other hand, Neural Architecture Search(NAS) (Howard et al., 2019; Cai et al., 2019; Yu et al., 2020) is proposed to automatically search for compact architectures, which avoids expert efforts and design trials. In this work, we explore the ability of NAS in finding quantized compact models and thus enjoy merits from two sides. Traditional combination of NAS and quantization methods could either be classified to NASthen-Quantize or Quantization-aware NAS as shown in Figure 1 . Conventional quantization methods merely compress the off-the-shelf networks, regardless of whether it is searched (EfficientNet (Tan & Le, 2019)) or handcrafted (MobileNetV2 (Sandler et al., 2018) ). These methods correspond to NAS-then-Quantize approach as shown in Figure 1(a) . However, it is not optimal because the accuracy rank among the searched floating-point models would change after they are quantized. Thus, this traditional routine may fail to get a good quantized model. Directly search with quantized models' performance seems to be a solution. Existing quantization-aware NAS methods (Wang et al., 2019; Shen et al., 2019; Bulat et al., 2020; Guo et al., 2019; Wang et al., 2020) utilize a two-stage search-retrain schema as shown in Figure 1(b) . Specifically, they first search for one architecture under one bit-width settingfoot_0 , and then retrain the model under the given bit-width. This two-stage procedure undesirably increases the search and retrain cost if we have multiple deployment constraints and hardware bit-widths. Furthermore, due to the instability brought by quantization-aware training, simply combining quantization and NAS results in unreliable ranking (Li et al., 2019a; Guo et al., 2019) quantized models (Bulat et al., 2020) . Moreover, when the quantization bit-width is lower than 3, the traditional training process is highly unstable and introduces very large accuracy degradation. To alleviate the aforementioned problems, we present Once Quantized for All (OQA), a novel framework that: 1) searches for quantized network architectures and deploys their quantized weights immediately without retraining, 2) progressively produces a series of quantized models under ultra-low bits(e.g. 4/3/2 bit). Our approach leverages the recent NAS approaches which do not require retraining (Yu & Huang, 2019; Cai et al., 2019; Yu et al., 2020) . We adopt the search for kernel size, depth, width, and resolution in our search space. To provide a better initialization and transfer the knowledge of higher bit-width QNN to the lower bit-width QNN, we propose bit inheritance mechanism, which reduces the bit-width progressively to enable searching for QNN under different quantization bit-widths. Benefiting from the no retraining property and large search space under different bit-widths, we can evaluate the effect of network factors. Extensive experiments show the effectiveness of our approach. Our searched quantized model family, OQANets, achieves state-of-the-art (SOTA) results on the ImageNet dataset under 4/3/2 bitwidths. In particular, our OQA2bit-L far exceeds the accuracy of 2 bit Efficient-B0@QKD (Kim et al., 2019) by a large 14% margin using 30% less computation budget. Compared with the quantization-aware NAS method APQ (Wang et al., 2020), our OQA4bit-L-MBV2 uses 43.7% less computation cost while maintaining the same accuracy as APQ-B. To summarize, the contributions of our paper are three-fold: • Our OQA is the first quantization-aware NAS framework to search for the architecture of quantized compact models and deploy their quantized weights without retraining. • We present the bit inheritance mechanism to reduce the bit-width progressively so that the higher bit-width models can guide the search and training of lower bit-width models. • We provide some insights into quantization-friendly architecture design. Our systematical analysis reveals that shallow-fat models are more likely to be quantization-friendly than deep-slim models under low bit-widths.



One bit-width setting refers to a specific bit-width for each layer, where different layers could have different bit-widths.



Figure1: The overall frameworks of existing works on combining quantization and NAS and our method. (a) denotes directly converting the best searched floating-point architecture to quantization. (b) first adopts a quantization-aware search algorithm to find a single architecture, then retrain the quantizaed weights and activation. Our OQA (c) can search for many quantized compact models under various bit-widths and deploy their quantized weights directly.

and sub-optimal

