OTOV2: AUTOMATIC, GENERIC, USER-FRIENDLY

Abstract

The existing model compression methods via structured pruning typically require complicated multi-stage procedures. Each individual stage necessitates numerous engineering efforts and domain-knowledge from the end-users which prevent their wider applications onto broader scenarios. We propose the second generation of Only-Train-Once (OTOv2), which first automatically trains and compresses a general DNN only once from scratch to produce a more compact model with competitive performance without fine-tuning. OTOv2 is automatic and pluggable into various deep learning applications, and requires almost minimal engineering efforts from the users. Methodologically, OTOv2 proposes two major improvements: (i) Autonomy: automatically exploits the dependency of general DNNs, partitions the trainable variables into Zero-Invariant Groups (ZIGs), and constructs the compressed model; and (ii) Dual Half-Space Projected Gradient (DHSPG): a novel optimizer to more reliably solve structured-sparsity problems. Numerically, we demonstrate the generality and autonomy of OTOv2 on a variety of model architectures such as VGG, ResNet, CARN, ConvNeXt, DenseNet and StackedUnets, the majority of which cannot be handled by other methods without extensive handcrafting efforts. Together with benchmark datasets including CIFAR10/100, DIV2K, Fashion-MNIST, SVNH and ImageNet, its effectiveness is validated by performing competitively or even better than the state-of-the-arts. The source code is available at https://github.com/tianyic/only_train_once.

1. INTRODUCTION

Large-scale Deep Neural Networks (DNNs) have demonstrated successful in a variety of applications (He et al., 2016) . However, how to deploy such heavy DNNs onto resource-constrained environments is facing severe challenges. Consequently, in both academy and industry, compressing full DNNs into slimmer ones with negligible performance regression becomes popular. Although this area has been explored in the past decade, it is still far away from being fully solved. Weight pruning is perhaps the most popular compression method because of its generality and ability in achieving significant reduction of FLOPs and model size by identifying and removing redundant structures (Gale et al., 2019; Han et al., 2015; Lin et al., 2019) . However, most existing pruning methods typically proceed a complicated multi-stage procedure as shown in Figure 1 , which has apparent limitations: (i) Hand-Craft and User-Hardness: requires significant engineering efforts and expertise from users to apply the methods onto their own scenarios; (ii) Expensiveness: conducts DNN training multiple times including the foremost pre-training, the intermediate training for identifying redundancy and the afterwards fine-tuning; and (iii) Low-Generality: many methods are designed for specific architectures and tasks and need additional efforts to be extended to others. To address those drawbacks, we naturally need a DNN training and pruning method to achieve the Goal. Given a general DNN, automatically train it only once to achieve both high performance and slimmer model architecture simultaneously without pre-training and fine-tuning. To realize, the following key problems need to be resolved systematically. (i) What are the removal structures (see Section 3.1 for a formal definition) of DNNs? (ii) How to identify the redundant removal structures? (iii) How to effectively remove redundant structures without deteriorating the model performance to avoid extra fine-tuning? (iv) How to make all the above proceeding automatically? Addressing them is challenging in the manner of both algorithmic designs and engineering developments, thereby is not achieved yet by the existing methods to our knowledge. To resolve (i-iii), Only-Train-Once (OTOv1) (Chen et al., 2021b) proposed a concept so-called Zero-Invariant Group (ZIG), which is a class of minimal removal structures that can be safely removed without affecting the network output if their parameters are zero. To jointly identify redundant ZIGs and achieve satisfactory performance, OTOv1 further proposed a Half-Space Projected Gradient (HSPG) method to compute a solution with both high performance and group sparsity over ZIGs, wherein zero groups correspond to redundant removal structures. As a result, OTOv1 trains a full DNN from scratch only once to compute a slimmer counterpart exhibiting competitive performance without fine-tuning, and is perhaps the closest to the goal among the existing competitors. Nevertheless, the fundamental problem (iv) is not addressed in OTOv1, i.e., the ZIGs partition is not automated and only implemented onto several specific architectures. OTOv1 suffers a lot from requiring extensive hand-crafting efforts and domain knowledge to partition trainable variables into ZIGs which prohibits its broader usages. Meanwhile, OTOv1 highly depends on HSPG to yield a solution with both satisfactory performance and high group sparsity. However, the sparsity exploration of HSPG is typically sensitive to the regularization coefficient thereby requires time-consuming hyper-parameter tuning and lacks capacity to precisely control the ultimate sparsity level. To overcome the drawbacks of OTOv1 and simultaneously tackle (i-iv), we propose Only-Train-Once v2 (OTOv2), the next-generation one-shot deep neural network training and pruning framework. Given a full DNN , OTOv2 is able to train and compress it from scratch into a slimmer DNN with significant FLOPs and parameter quantity reduction. In contrast to others, OTOv2 drastically simplifies the complicated multistage procedures; guarantees performance more reliably than OTOv1; and is generic, automatic and user-friendly. Our main contributions are summarized as follows.

Library Usage

• Infrastructure for Automated DNN One-Shot Training and Compression. We propose and develop perhaps the first generic and automated framework to compress a general DNN with both excellent performance and substantial complexity reduction in terms of FLOPs and model cardinality. OTOv2 only trains the DNN once, neither pre-training nor fine-tuning is a necessity. OTOv2 is user-friendly and easily applied onto generic tasks as shown in library usage. Its success relies on the breakthroughs from both algorithmic designs and infrastructure developments. • Automated ZIG Partition and Automated Compressed Model Construction. We propose a novel graph algorithm to automatically exploit and partition the variables of a general DNN into Zero-Invariant Groups (ZIGs), i.e., the minimal groups of parameters that need to be pruned together. We further propose a novel algorithm to automatically construct the compressed model by the hierarchy of DNN and eliminating the structures corresponding to ZIGs as zero. Both algorithms are dedicately designed, and work effectively with low time and space complexity. • Novel Structured-Sparsity Optimization Algorithm. We propose a novel optimization algorithm, called Dual Half-Space Projected Gradient (DHSPG), to train a general DNN once from scratch to effectively achieve competitive performance and high group sparsity in the manner of ZIGs, which solution is further leveraged into the above automated compression. DHSPG formulizes a constrained sparse optimization problem and solves it by constituting a direction within the intersection of dual half-spaces to largely ensure the progress to both the objective convergence and the identification of redundant groups. DHSPG outperforms the HSPG in OTOv1 in terms of enlarging search space, fewer hyper-parameter tuning, and more reliably controlling sparsity.



Figure 1: OTOv2 versus existing methods.

1 from only train once import OTO 2 # General DNN model 3 oto = OTO(model) 4 optimizer = oto.dhspg() 5 # Train as normal 6 optimizer.step() 7 oto.compress()

funding

† Partially supported by NSF grant CCF-2240708.

