COMPOFA: COMPOUND ONCE-FOR-ALL NETWORKS FOR FASTER MULTI-PLATFORM DEPLOYMENT

Abstract

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -and hence the training budget -by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time 1 and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms. 2 

1. INTRODUCTION

CNNs are emerging in mainstream deployment across diverse hardware platforms, latency requirements, and/or workload characteristics. The available processing power, memory, and latency requirements may vary vastly across deployment platforms -say, from server-grade GPUs to lowpower embedded devices, cycles of high or low workload, etc. Since model accuracies tend to increase with computational budget, it becomes vital to build models tailored to each deployment scenario, maximizing accuracy constrained by the desired model inference latency. These efficient models lie close to the Pareto-frontier of the accuracy-latency tradeoff. Building such models (either manually or by searching) and then training them are resourceintensive tasks -requiring massive computational resources, expertise in both ML and underlying systems, time, dollar cost, and CO 2 emissions every time they are performed. Repeating such intensive processes for each deployment target is prohibitively expensive w.r.t. multiple metrics of cost and this does not scale. Once-For-All (OFA) (Cai et al., 2020) proposed to address this challenge by decoupling the search and training phases through a novel progressive shrinking algorithm. OFA builds a family of 10 19 models of varying depth, width, kernel size, and image resolution. These models are jointly trained in a single-shot via sharing of their intersecting weights. Once trained, search techniques can extract specialized sub-networks that meet specific deployment targets -a task that can then be independently repeated on the same trained family. This massive search space leads to a training cost that remains prohibitively expensive. Though the cost can be amortized over a number of deployment targets, it' still significant -reaching 1200 GPU hours for OFA. The search space arises from training every possible model combination, and many



and, therefore, dollar cost and CO2 emissions Our source code is available at https://github.com/gatech-sysml/CompOFA 1

