CONDITIONAL EXECUTION OF CASCADED MODELS IMPROVES THE ACCURACY-EFFICIENCY TRADE-OFF Anonymous

Abstract

The compute effort required to perform inference on state-of-the-art deep learning models is ever growing. Practical applications are commonly limited to a certain cost per inference. Cascades of pretrained models with conditional execution address these requirements based on the intuition that some inputs are easy enough that they can be processed correctly by a small model allowing for an early exit. If the small model is not sufficiently confident in its prediction, the input is passed on to a larger model. The selection of the confidence threshold allows to trade off compute effort against accuracy. In this work, we explore the effective design of model cascades, and thoroughly evaluate the impact on the accuracy-compute trade-off. We find that they not only interpolate favorably between pretrained models, but that this trade-off curve commonly outperforms single models. This allows us to redefine most of the ImageNet Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1× above 86% and more than 1.9× between 80% and 86% top-1 accuracy. We confirm the wide applicability and effectiveness of the method on the GLUE benchmark. We release the code to reproduce our experiments in the supplementary material and use only publicly available models and datasets.

1. INTRODUCTION

The trade-off between accuracy and efficiency is fundamentally important to deep learning. While state-of-the-art results are achieved by models of ever growing size, practical applications are constrained. Reducing the energy consumption is important for energy limited systems like wearable and mobile devices. For inference at large scale such as in data centers, minimizing the compute resource requirements is important economically and environmentally. There are many different approaches to improve the accuracy-efficiency trade-off (see Section 2). We explore early exiting from cascades of pretrained models for classification tasks. Such an approach can work when some inputs are easier to classify than others and therefore can be processed by a smaller model to save computation. Assuming the input's difficulty is unknown, a first classification can be made by the smaller model and if it is not confident in its prediction, the input can be classified again by a larger and more accurate model. The models in an early exit cascade are ordered by cost from cheapest to most expensive and an early exit is made after the first model where the prediction confidence satisfies the exit condition. Early exiting is often explored from within a model like in BranchyNet (Teerapittayanon et al., 2016) or Shallow-Deep Networks (Kaya et al., 2019) , by inserting additional classifier outputs earlier in the model. Huang et al. (2017) describe how early classifiers lack coarse features and training the model for early classification can lower the accuracy of later classifiers. Furthermore, modern convolutional neural networks are highly optimized and their depth, width and resolution is carefully scaled, gradually lowering the resolution and increasing the number of channels as features become more complex (Tan & Le, 2019). For early classification outputs, this optimization and scaling is disrupted, which results in a lower accuracy than a separate model at equal compute effort can achieve. We demonstrate experimentally that a smaller accuracy difference between classifiers allows for more frequent early exiting. This motivates using a separate model for early classification, which enables us to increase the early exit rate at the cost of a small overhead when forwarding to the larger model, since features are no longer shared between classifiers. However, early exit-ing from within a single model and between cascaded models are two separate and complementary approaches that can be combined (Bolukbasi et al., 2017) . A disadvantage of early exit model cascades is that more than one model is required, which increases memory as well as worst case latency and compute effort, where the worst case is no early exit. However, the costs are dominated by the largest model, which means that the added cost of smaller models is usually negligible. Key advantages are the ease of use and efficacy. Pretrained models can be used as is or finetuned. Cascading is a complementary approach and can be combined with many other approaches for increasing efficiency, particularly methods that improve the individual models, such as better architecture and training. When the input is processed by at least 2 models in the cascade, the predictions can be ensembled. Gontijo-Lopes et al. (2021) show that ensembles are more accurate when the models are more diverse due to the lower error correlation, which can also be leveraged by cascades. Although early exit model cascades are simple and effective, little prior work on them has been published (see Section 2). We explore how to build such cascades effectively on ImageNet (Russakovsky et al., 2015) . For this we compare many different cascading methods across continuous Pareto fronts in multiple settings using a diverse selection of pretrained models for a thorough and reliable evaluation. We then confirm their wide applicability on text classification tasks from the GLUE (Wang et al., 2018a) benchmark. Our contributions and findings are: • We demonstrate that already 2-model cascades dominate the entire ImageNet Pareto front. • We provide insight into how to construct the model cascades: We empirically demonstrate a relationship between accuracy difference of cascaded models and achievable early exit rate as well as desirable size difference in Figure 2 . • The maximum softmax confidence metric achieves most improvement overall while softmax margin excels in low confidence scenarios and the commonly used entropy metric performs worst. • Ensembling predictions, when inference is done for at least 2 models in a cascade, generally increases the top accuracy achieved by the cascade but lowers Pareto improvement. Temperature scaled calibration is ineffective at alleviating this. • For our evaluation we rely on an external baseline comprised of Pareto optimal public models to ensure our results are credible and relevant. This means we provide a strong and reproducible baseline for future related research that is currently missing. BranchyNet (Teerapittayanon et al., 2016) adds branches to the original net for early evaluation and exits when prediction entropy is below a threshold. Wołczyk et al. (2021) improve early exiting from within a model by recycling predictions of earlier classifiers. We focus on early exiting from a cascade of models, which is a complementary approach that is comparatively simple yet effective.

2. RELATED WORK

Early exit model cascades Cascades are commonly used in machine learning and have been popularized by influential works such as Viola & Jones (2001) . We focus on a specific type of cascade



Accuracy-efficiency trade-off Improving the accuracy-efficiency Pareto front is a central objective for a wide breadth of research with many different approaches.MobileNetV3 (Howard et al.,  2019)  and EfficientNet (Tan & Le, 2019) represent how architectures have become highly optimized. Model efficiency can be improved further by applying compression techniques like quantization and pruning(Han et al., 2015). Once-for-all(Cai et al., 2019)  utilizes progressive shrinking of input resolution, kernel size, network width and depth together with knowledge distillation to obtain more efficient models than conventional neural architecture search. Training is very important with recent advances most notably through data augmentation (Shorten & Khoshgoftaar, 2019) and pretraining on more data(Ridnik et al., 2021), which is enabled further by self-supervised(Devlin et al., 2018)   and semi-supervised(Pham et al., 2021)  methods. Many other procedures exist such as model soups(Wortsman et al., 2022), which fine-tunes a model with multiple hyperparameter configurations and averages the weights. More closely related to our work are dynamic neural networks(Han et al.,  2021; Xu & McAuley, 2022). An example for dynamic depth is SkipNet(Wang et al., 2018b), which adds gating modules and a learned skipping policy to skip network layers based on the input.

