CONDITIONAL EXECUTION OF CASCADED MODELS IMPROVES THE ACCURACY-EFFICIENCY TRADE-OFF Anonymous

Abstract

The compute effort required to perform inference on state-of-the-art deep learning models is ever growing. Practical applications are commonly limited to a certain cost per inference. Cascades of pretrained models with conditional execution address these requirements based on the intuition that some inputs are easy enough that they can be processed correctly by a small model allowing for an early exit. If the small model is not sufficiently confident in its prediction, the input is passed on to a larger model. The selection of the confidence threshold allows to trade off compute effort against accuracy. In this work, we explore the effective design of model cascades, and thoroughly evaluate the impact on the accuracy-compute trade-off. We find that they not only interpolate favorably between pretrained models, but that this trade-off curve commonly outperforms single models. This allows us to redefine most of the ImageNet Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1× above 86% and more than 1.9× between 80% and 86% top-1 accuracy. We confirm the wide applicability and effectiveness of the method on the GLUE benchmark. We release the code to reproduce our experiments in the supplementary material and use only publicly available models and datasets.

1. INTRODUCTION

The trade-off between accuracy and efficiency is fundamentally important to deep learning. While state-of-the-art results are achieved by models of ever growing size, practical applications are constrained. Reducing the energy consumption is important for energy limited systems like wearable and mobile devices. For inference at large scale such as in data centers, minimizing the compute resource requirements is important economically and environmentally. There are many different approaches to improve the accuracy-efficiency trade-off (see Section 2). We explore early exiting from cascades of pretrained models for classification tasks. Such an approach can work when some inputs are easier to classify than others and therefore can be processed by a smaller model to save computation. Assuming the input's difficulty is unknown, a first classification can be made by the smaller model and if it is not confident in its prediction, the input can be classified again by a larger and more accurate model. The models in an early exit cascade are ordered by cost from cheapest to most expensive and an early exit is made after the first model where the prediction confidence satisfies the exit condition. Early exiting is often explored from within a model like in BranchyNet (Teerapittayanon et al., 2016) or Shallow-Deep Networks (Kaya et al., 2019) , by inserting additional classifier outputs earlier in the model. Huang et al. (2017) describe how early classifiers lack coarse features and training the model for early classification can lower the accuracy of later classifiers. Furthermore, modern convolutional neural networks are highly optimized and their depth, width and resolution is carefully scaled, gradually lowering the resolution and increasing the number of channels as features become more complex (Tan & Le, 2019) . For early classification outputs, this optimization and scaling is disrupted, which results in a lower accuracy than a separate model at equal compute effort can achieve. We demonstrate experimentally that a smaller accuracy difference between classifiers allows for more frequent early exiting. This motivates using a separate model for early classification, which enables us to increase the early exit rate at the cost of a small overhead when forwarding to the larger model, since features are no longer shared between classifiers. However, early exit-

