BASISNET: TWO-STAGE MODEL SYNTHESIS FOR EF-FICIENT INFERENCE

Abstract

We present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach uses a lightweight model to preview an image and generate input-dependent combination coefficients, which are later used to control the synthesis of a specialist model for making more accurate final prediction. The two-stage model synthesis strategy can be used with any network architectures and both stages can be jointly trained end to end. We validated BasisNet on ImageNet classification with MobileNets as backbone, and demonstrated clear advantage on accuracy-efficiency trade-off over strong baselines such as EfficientNet (Tan & Le, 2019), FBNetV3 (Dai et al., 2020) and OFA (Cai et al., 2019). Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations (MAdds), halving the computational cost of previous state-of-theart without sacrificing accuracy. Besides, since the first-stage lightweight model can independently make predictions, inference can be terminated early if the prediction is sufficiently confident. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0%.

1. INTRODUCTION

High-accuracy yet low-latency convolutional neural networks enable opportunities for on-device machine learning, and are playing increasingly important roles in various mobile applications, including but not limited to intelligent personal assistants, AR/VR and real-time voice translations. Designing efficient convolutional neural networks especially for edge devices has received significant research attention. Prior research attempted to tackle this challenge from different perspectives, such as novel network architectures (Howard et al., 2017; Sandler et al., 2018; Ma et al., 2018; Zhang et al., 2018; Howard et al., 2019) , better incorporation with hardware accelerators (Lee et al., 2019) , or conditional computation and adaptive inference algorithms (Bolukbasi et al., 2017; Figurnov et al., 2017; Leroux et al., 2017; Wang et al., 2018; Marquez et al., 2018) . However, focusing on one perspective in isolation may have side effects. For example, novel network architectures may introduce custom operators that are not well-supported by hardware accelerators, thus a promising new model may have limited practical improvements on real devices due to a lack of hardware support. We believe that these three perspectives should be better integrated to form a more holistic general approach that ensures the broader applicability of the resulting system. In this paper, we present BasisNet, which takes advantage of progress in all these perspectives and combines several key ideas in a simple new form. The core idea behind BasisNet is dynamic model synthesis, which aims at efficiently generating sample-dependent specialist model from a collection of bases, so the resultant model is specialized at handling the given input and can give more accurate predictions. This concept is flexible and can be applied to any novel network architectures. On the hardware side, the two-stage model synthesis strategy allows the execution of the lightweight and synthesized specialist model on different processing units (e.g., CPU, mobile GPUs, dedicated accelerators, etc.) in parallel to better handle streaming data. The BasisNet design is naturally compatible with early termination, and can easily balance between computation budget and accuracy with a single hyperparameter (prediction confidence). An overview of the BasisNet is shown in Fig. 1 . Using image classification as an example, our BasisNet has two stages: the first stage relies on a lightweight model to preview the input image 1

