BASISNET: TWO-STAGE MODEL SYNTHESIS FOR EF-FICIENT INFERENCE

Abstract

We present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach uses a lightweight model to preview an image and generate input-dependent combination coefficients, which are later used to control the synthesis of a specialist model for making more accurate final prediction. The two-stage model synthesis strategy can be used with any network architectures and both stages can be jointly trained end to end. We validated BasisNet on ImageNet classification with MobileNets as backbone, and demonstrated clear advantage on accuracy-efficiency trade-off over strong baselines such as EfficientNet (Tan & Le, 2019), FBNetV3 (Dai et al., 2020) and OFA (Cai et al., 2019). Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations (MAdds), halving the computational cost of previous state-of-theart without sacrificing accuracy. Besides, since the first-stage lightweight model can independently make predictions, inference can be terminated early if the prediction is sufficiently confident. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0%.

1. INTRODUCTION

High-accuracy yet low-latency convolutional neural networks enable opportunities for on-device machine learning, and are playing increasingly important roles in various mobile applications, including but not limited to intelligent personal assistants, AR/VR and real-time voice translations. Designing efficient convolutional neural networks especially for edge devices has received significant research attention. Prior research attempted to tackle this challenge from different perspectives, such as novel network architectures (Howard et al., 2017; Sandler et al., 2018; Ma et al., 2018; Zhang et al., 2018; Howard et al., 2019) , better incorporation with hardware accelerators (Lee et al., 2019) , or conditional computation and adaptive inference algorithms (Bolukbasi et al., 2017; Figurnov et al., 2017; Leroux et al., 2017; Wang et al., 2018; Marquez et al., 2018) . However, focusing on one perspective in isolation may have side effects. For example, novel network architectures may introduce custom operators that are not well-supported by hardware accelerators, thus a promising new model may have limited practical improvements on real devices due to a lack of hardware support. We believe that these three perspectives should be better integrated to form a more holistic general approach that ensures the broader applicability of the resulting system. In this paper, we present BasisNet, which takes advantage of progress in all these perspectives and combines several key ideas in a simple new form. The core idea behind BasisNet is dynamic model synthesis, which aims at efficiently generating sample-dependent specialist model from a collection of bases, so the resultant model is specialized at handling the given input and can give more accurate predictions. This concept is flexible and can be applied to any novel network architectures. On the hardware side, the two-stage model synthesis strategy allows the execution of the lightweight and synthesized specialist model on different processing units (e.g., CPU, mobile GPUs, dedicated accelerators, etc.) in parallel to better handle streaming data. The BasisNet design is naturally compatible with early termination, and can easily balance between computation budget and accuracy with a single hyperparameter (prediction confidence). An overview of the BasisNet is shown in Fig. 1 . Using image classification as an example, our BasisNet has two stages: the first stage relies on a lightweight model to preview the input image Figure 1 : An overview of the BasisNet and more details can be found in Sec. 3.2. For easy images (e.g., distinguishing cat from dogs), lightweight model can give sufficiently accurate predictions thus the second stage could be bypassed. For more difficult images (e.g., distinguishing different breeds of dogs), a specialist model is synthesized following guidance from lightweight model, which is good at recognizing subtle differences to make more accurate predictions about the given images. and produce both an initial prediction and a group of combination coefficients. In the second stage, the coefficients are used to combine a set of models, which we call basis models, into a single one to process the image and generate the final classification result. The second stage could be skipped if the initial prediction is sufficiently confident. The basis models share the same architecture but differ in some weight parameters, while other weights are shared to avoid overfitting and reduce the total model size. We validated BasisNet with different generations and sizes of MobileNets and observed significant improvements in inference efficiency. In Table 1 we show comparisons 1 with recent efficient networks on ImageNet classification benchmark. Notably, without using early termination, our Ba-sisNet with 16 basis models of MobileNetV3-large only requires 290M Multiply-Adds (MAdds) to achieve 80.3% top-1 accuracy, halving the computation cost of previous state-of-the-art (Cai et al., 2019) without sacrificing accuracy. If we enable early termination, the average cost can be further reduced to 198M MAdds with the top-1 accuracy remaining 80.0% on ImageNet.



1 Listed models may use different training recipes (e.g., knowledge distillation♦, extra data♥, custom data augmentation♠, and AutoML-based training hyperparameters search♣, etc.)2 Avg cost is reduced since easy inputs are only handled by lightweight model; max remains 290M MAdds.



Comparison with selected efficient networks on ImageNet. Statistics on referenced baselines are cited from original papers. See Appendix C for detailed comparison incl. training recipes.

