COMPOFA: COMPOUND ONCE-FOR-ALL NETWORKS FOR FASTER MULTI-PLATFORM DEPLOYMENT

Abstract

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -and hence the training budget -by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time 1 and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms. 2 

1. INTRODUCTION

CNNs are emerging in mainstream deployment across diverse hardware platforms, latency requirements, and/or workload characteristics. The available processing power, memory, and latency requirements may vary vastly across deployment platforms -say, from server-grade GPUs to lowpower embedded devices, cycles of high or low workload, etc. Since model accuracies tend to increase with computational budget, it becomes vital to build models tailored to each deployment scenario, maximizing accuracy constrained by the desired model inference latency. These efficient models lie close to the Pareto-frontier of the accuracy-latency tradeoff. Building such models (either manually or by searching) and then training them are resourceintensive tasks -requiring massive computational resources, expertise in both ML and underlying systems, time, dollar cost, and CO 2 emissions every time they are performed. Repeating such intensive processes for each deployment target is prohibitively expensive w.r.t. multiple metrics of cost and this does not scale. Once-For-All (OFA) (Cai et al., 2020) proposed to address this challenge by decoupling the search and training phases through a novel progressive shrinking algorithm. OFA builds a family of 10 19 models of varying depth, width, kernel size, and image resolution. These models are jointly trained in a single-shot via sharing of their intersecting weights. Once trained, search techniques can extract specialized sub-networks that meet specific deployment targets -a task that can then be independently repeated on the same trained family. This massive search space leads to a training cost that remains prohibitively expensive. Though the cost can be amortized over a number of deployment targets, it' still significant -reaching 1200 GPU hours for OFA. The search space arises from training every possible model combination, and many . This exhaustive approach misses opportunities for any accuracy-or latency-guided exploration in such a vast space, thus suffering a clear inefficiency. These sub-optimal models not only go unutilized but also add training interference, which necessitates a longer phased training to stabilize their simultaneous optimization. Finally, searching & extracting an optimal model from this space can only be done via indirect estimators of their accuracy and latency, as all model combinations cannot be enumerated. On the other hand, we argue that such a large search space in unnecessary for two reasons. First, common practices as well as empirical studies (Tan & Le, 2019; Radosavovic et al., 2020) have shown that model dimensions such as depth, width, and resolution are not orthogonal -models that follow some compound couplings between these dimensions produce a better accuracy-latency trade-off than those with unconstrained settings. Informally, increasing model capacity along one dimension (say, depth) is helped by an accompanying increase along another dimension (say, width). Secondly, a much coarser latency granularity (order of 1 ms) is sufficient for practical systems deployment. In this work we propose CompOFA -a model design space leveraging compound couplings between model dimensions, and demonstrate the following: 1. Utilizing the insight of compound coupling, we show that simple, easy to implement heuristics can capture models close to the Pareto frontier (depicted in Figure1(c). This enables us to reduce OFA's 10 19 models to just 243 in CompOFA, and still train a model family with an equally good accuracy-latency tradeoff. 2. We show that this tractable design space directly reduces interference in training, which allows us to reduce training duration and cost by 2x. 3. Once trained, CompOFA's simplicity avails itself to easier extraction that's faster by 216x. 4. Despite the size reduction, we show that the latency granularity is sufficient to cover the same range and diversity of hardware targets as OFA. 5. Finally, the generality of CompOFA's insights is validated by training it on another base architecture, achieving similar gains.

2. RELATED WORK

Efficient neural network design has been an active area of research due to the high computational complexity of CNNs. NAS is increasingly used to guide or replace previously manual design pro-



and, therefore, dollar cost and CO2 emissions Our source code is available at https://github.com/gatech-sysml/CompOFA



Figure 1: (a): Conventional methods require expensive designing & training per deployment platform, which is infeasible to scale. (b): OFA co-trains a family of subnetworks of a teacher supernet. However, combinatorial explosion of depth (D) and width (W) compels progressive, phased training requiring 1200 GPU hours. (c): CompOFA exploits the insight of compound couplings between D & W to vastly simplify the search space while maintaining Pareto optimality. The smaller space can be trained in half the time without phases, and gives equally performant and diverse model families

