OTOV2: AUTOMATIC, GENERIC, USER-FRIENDLY

Abstract

The existing model compression methods via structured pruning typically require complicated multi-stage procedures. Each individual stage necessitates numerous engineering efforts and domain-knowledge from the end-users which prevent their wider applications onto broader scenarios. We propose the second generation of Only-Train-Once (OTOv2), which first automatically trains and compresses a general DNN only once from scratch to produce a more compact model with competitive performance without fine-tuning. OTOv2 is automatic and pluggable into various deep learning applications, and requires almost minimal engineering efforts from the users. Methodologically, OTOv2 proposes two major improvements: (i) Autonomy: automatically exploits the dependency of general DNNs, partitions the trainable variables into Zero-Invariant Groups (ZIGs), and constructs the compressed model; and (ii) Dual Half-Space Projected Gradient (DHSPG): a novel optimizer to more reliably solve structured-sparsity problems. Numerically, we demonstrate the generality and autonomy of OTOv2 on a variety of model architectures such as VGG, ResNet, CARN, ConvNeXt, DenseNet and StackedUnets, the majority of which cannot be handled by other methods without extensive handcrafting efforts. Together with benchmark datasets including CIFAR10/100, DIV2K, Fashion-MNIST, SVNH and ImageNet, its effectiveness is validated by performing competitively or even better than the state-of-the-arts. The source code is available at https://github.com/tianyic/only_train_once.

1. INTRODUCTION

Large-scale Deep Neural Networks (DNNs) have demonstrated successful in a variety of applications (He et al., 2016) . However, how to deploy such heavy DNNs onto resource-constrained environments is facing severe challenges. Consequently, in both academy and industry, compressing full DNNs into slimmer ones with negligible performance regression becomes popular. Although this area has been explored in the past decade, it is still far away from being fully solved. Weight pruning is perhaps the most popular compression method because of its generality and ability in achieving significant reduction of FLOPs and model size by identifying and removing redundant structures (Gale et al., 2019; Han et al., 2015; Lin et al., 2019) . However, most existing pruning methods typically proceed a complicated multi-stage procedure as shown in Figure 1 , which has apparent limitations: (i) Hand-Craft and User-Hardness: requires significant engineering efforts and expertise from users to apply the methods onto their own scenarios; (ii) Expensiveness: conducts DNN training multiple times including the foremost pre-training, the intermediate training for identifying redundancy and the afterwards fine-tuning; and (iii) Low-Generality: many methods are designed for specific architectures and tasks and need additional efforts to be extended to others. To address those drawbacks, we naturally need a DNN training and pruning method to achieve the Goal. Given a general DNN, automatically train it only once to achieve both high performance and slimmer model architecture simultaneously without pre-training and fine-tuning. To realize, the following key problems need to be resolved systematically. (i) What are the removal structures (see Section 3.1 for a formal definition) of DNNs? (ii) How to identify the redundant removal structures? (iii) How to effectively remove redundant structures without deteriorating the model performance to avoid extra fine-tuning? (iv) How to make all the above proceeding automatically? Addressing them is challenging in the manner of both algorithmic designs and engineering developments, thereby is not achieved yet by the existing methods to our knowledge. To resolve (i-iii), Only-Train-Once (OTOv1) (Chen et al., 2021b) proposed a concept so-called Zero-Invariant Group (ZIG), which is a class of minimal removal structures that can be safely removed without affecting the network output if their parameters are zero. To jointly identify redundant ZIGs and achieve satisfactory performance, OTOv1 further proposed a Half-Space Projected Gradient (HSPG) method to compute a solution with both high performance and group sparsity over ZIGs, wherein zero groups correspond to redundant removal structures. As a result, OTOv1 trains a full DNN from scratch only once to compute a slimmer counterpart exhibiting competitive performance without fine-tuning, and is perhaps the closest to the goal among the existing competitors. Nevertheless, the fundamental problem (iv) is not addressed in OTOv1, i.e., the ZIGs partition is not automated and only implemented onto several specific architectures. OTOv1 suffers a lot from requiring extensive hand-crafting efforts and domain knowledge to partition trainable variables into ZIGs which prohibits its broader usages. Meanwhile, OTOv1 highly depends on HSPG to yield a solution with both satisfactory performance and high group sparsity. However, the sparsity exploration of HSPG is typically sensitive to the regularization coefficient thereby requires time-consuming hyper-parameter tuning and lacks capacity to precisely control the ultimate sparsity level. To overcome the drawbacks of OTOv1 and simultaneously tackle (i-iv), we propose Only-Train-Once v2 (OTOv2), the next-generation one-shot deep neural network training and pruning framework. Given a full DNN , OTOv2 is able to train and compress it from scratch into a slimmer DNN with significant FLOPs and parameter quantity reduction. In contrast to others, OTOv2 drastically simplifies the complicated multistage procedures; guarantees performance more reliably than OTOv1; and is generic, automatic and user-friendly. Our main contributions are summarized as follows. • Infrastructure for Automated DNN One-Shot Training and Compression. We propose and develop perhaps the first generic and automated framework to compress a general DNN with both excellent performance and substantial complexity reduction in terms of FLOPs and model cardinality. OTOv2 only trains the DNN once, neither pre-training nor fine-tuning is a necessity. OTOv2 is user-friendly and easily applied onto generic tasks as shown in library usage. Its success relies on the breakthroughs from both algorithmic designs and infrastructure developments. • Automated ZIG Partition and Automated Compressed Model Construction. We propose a novel graph algorithm to automatically exploit and partition the variables of a general DNN into Zero-Invariant Groups (ZIGs), i.e., the minimal groups of parameters that need to be pruned together. We further propose a novel algorithm to automatically construct the compressed model by the hierarchy of DNN and eliminating the structures corresponding to ZIGs as zero. Both algorithms are dedicately designed, and work effectively with low time and space complexity. • Novel Structured-Sparsity Optimization Algorithm. We propose a novel optimization algorithm, called Dual Half-Space Projected Gradient (DHSPG), to train a general DNN once from scratch to effectively achieve competitive performance and high group sparsity in the manner of ZIGs, which solution is further leveraged into the above automated compression. DHSPG formulizes a constrained sparse optimization problem and solves it by constituting a direction within the intersection of dual half-spaces to largely ensure the progress to both the objective convergence and the identification of redundant groups. DHSPG outperforms the HSPG in OTOv1 in terms of enlarging search space, fewer hyper-parameter tuning, and more reliably controlling sparsity.

3. OTOV2

OTOv2 has nearly reached the goal of model compression via weight pruning, which is outlined in Algorithm 1. In general, given a neural network M to be trained and compressed, OTOv2 first automatically figures out the dependencies among the vertices to exploit minimal removal structures and partitions the trainable variables into Zero-Invariant Groups (ZIGs) (Algorithm 2). ZIGs (G) are then fed into a structured sparsity optimization problem, which is solved by a Dual Half-Space Projected Gradient (DHSPG) method to yield a solution x * DHSPG with competitive performance as well as high group sparsity in the view of ZIGs (Algorithm 3). The compressed model M * is ultimately constructed via removing the redundant structures corresponding to the ZIGs being zero. M * significantly accelerates the inference in both time and space complexities and returns the identical outputs to the full model M parameterized as x * DHSPG due to the properties of ZIGs, thus avoids further fine-tuning M * . The whole procedure is proceeded automatically and easily employed onto various DNN applications and consumes almost minimal engineering efforts from the users. Algorithm 1 Outline of OTOv2.

1:

Input. An arbitrary full model M to be trained and compressed (no need to be pretrained). 

3.1. AUTOMATED ZIG PARTITION

Background. We review relevant concepts before describing how to proceed ZIG partition automatically. Due to the complicated connectivity of DNNs, removing an arbitrary structure or component may result in an invalid DNN. We say a structure removal if and only if the DNN without this component still serves as a valid DNN. Consequently, a removal structure is called minimal if and only if it can not be further decomposed into multiple removal structures. A particular class of minimal removal structures-that produces zero outputs to the following layer if their parameters being zeroare called ZIGs (Chen et al., 2021b) which can be removed directly without affecting the network output. Thus, each ZIG consists of a minimal group of variables that need to be pruned together and dominates most DNN structures, e.g., layers as Conv, Linear and MultiHeadAtten. While ZIGs exist for general DNNs, their topology can vary significantly due to the complicated connectivity. This together with the lack of API poses severe challenges to automatically exploit ZIGs in terms of both algorithmic designs and engineering developments. Algorithm 2 Automated Zero-Invariant Group Partition. 1: Input: A DNN M to be trained and compressed. 2: Construct the trace graph (E, V) of M. 3: Find connected components C over all accessory, shape-dependent joint and unknown vertices. 4: Grow C till incoming nodes are either stem or shape-independent joint vertices. 5: Merge connected components in C if any intersection. 6: Group pairwise parameters of stem vertices in the same connected component associated with parameters from affiliated accessory vertices if any as one ZIG into G. 7: Return the zero-invariant groups G. Algorithmic Outline. To resolve the autonomy of ZIG partition, we present a novel, effective and efficient algorithm. As outlined in Algorithm 2, the algorithm essentially partitions the graph of DNN into a set of connected components of dependency, then groups the variables based on the affiliations among the connected components. For more intuitive illustration, we provide a small but complicated DemoNet along with explanations about its ground truth minimal removal structures (ZIGs) in Figure 2 . We now elaborate Algorithm 2 to automatically recover the ground truth ZIGs. Graph Construction. Graph Construction. In particular, we first establish the trace graph (E, V) of the target DNN, wherein each vertex in V refers to a specific operator, and the edges in E describe how they connect (line 2 of Algorithm 2). We categorize the vertices into stem, joint, accessory or unknown. Stem vertices equip with trainable parameters and have capacity to transform their input tensors into other shapes, e.g., Conv and Linear. Joint vertices aggregate multiple input tensors into a single output such as Add, Mul and Concat. Accessory vertices operate a single input tensor into a single output and may possess trainable parameters such as BatchNorm and ReLu. The remaining unknown vertices proceed some uncertain operations. Apparently, stem vertices compose most of the DNN parameters. Joint vertices establish the connections cross different vertices, and thus dramatically bring hierarchy and intricacy of DNN. To keep the validness of the joint vertices, the minimal removal structures should be carefully constructed. Furthermore, we call joint vertices being input shape dependent (SD) if requiring inputs in the same shapes such as Add, otherwise being shape-independent (SID) such as Concat along the channel dimension for Conv layers as input. Construct Connected Components of Dependency. Construct Connected Components of Dependency. Now, we need to figure out the exhibiting dependency across the vertices to seek the minimal removal structures of the target DNN. To proceed, we first connect accessory, SD joint and unknown vertices together if adjacent to form a set of connected components C (see Figure 2c and line 3 of Algorithm 2). This step is to establish the skeletons for finding vertices that depend on each other when considering removing hidden structures. The underlying intuitions of this step in depth are (i) the adjacent accessory vertices operate and are subject to the same ancestral stem vertices if any; (ii) SD joint vertices force their ancestral stem vertices dependent on each other to yield tensors in the same shapes; and (iii) unknown vertices introduce uncertainty, hence finding potential affected vertices is necessary. We then grow C till all their incoming vertices are either stem or SID joint vertices and merge the connected components if any intersection as line 4-5. Remark here that the newly added stem vertices are affiliated by the accessory vertices, such as Conv1 for BN1-ReLu and Conv3+Conv2 for BN2|BN3 in Figure 2d . In addition, the SID joint vertices introduce dependency between their affiliated accessory vertices and incoming connected components, e.g., Concat-BN4 depends on both Conv1-BN1-ReLu and Conv3+Conv2-BN2|BN3 since BN4 normalizes their concatenated tensors along channel. Form ZIGs. Form ZIGs. Finally, we form ZIGs based on the connected components of dependency as Figure 2d . The pairwise trainable parameters across all individual stem vertices in the same connected component need to be first grouped together as Figure 2e , wherein the parameters of the same color represent one group. Later on, the accessory vertices insert their trainable parameters if applicable into the groups of their dependent stem vertices accordingly. Some accessory vertex such as BN4 may depend on multiple groups because of the SID joint vertex, thereby its trainable parameters γ 4 and β 4 need to be partitioned and separately added into corresponding groups, e.g., γ 1 4 , β 1 4 and γ 2 4 , β 2 4 . In addition, the connected components that are adjacent to the output of DNN are excluded from forming ZIGs since the output shape should be fixed such as Linear2. For safety, the connected components that possess unknown vertices are excluded as well due to uncertainty, which further guarantees the generality of the framework applying onto DNNs with customized operators. Complexity Analysis. The proposed automated ZIG partition Algorithm 2 is a series of customized graph algorithms dedicately composed together. In depth, every individual sub-algorithm is achieved by depth-first-search recursively traversing the trace graph of DNN and conducting step-specific operations, which has time complexity as O(|V| + |E|) and space complexity as O(|V|) in the worst case. The former one is computed by discovering all neighbors of each vertex by traversing the adjacency list once in linear time. The latter one is because the trace graph of DNN is acyclic thereby the memory cache consumption is up to the length of possible longest path for an acyclic graph as |V|. Therefore, automated ZIG partition can be efficiently completed within linear time.

3.2. DUAL HALF-SPACE PROJECTED GRADIENT (DHSPG)

Given the constructed ZIGs G by Algorithm 2, the next step is to jointly identify which groups are redundant to be removed and train the remaining groups to achieve high performance. To tackle it, we construct a structured sparsity optimization problem and solve it via a novel DHSPG. Compared with HSPG, DHSPG constitutes a dual-half-space direction with automatically selected regularization coefficients to more reliably control the sparsity exploration, and enlarges the search space by partitioning the ZIGs into separate sets to avoid trapping around the origin for better generalization. Target Problem. Structured sparsity inducing optimization problem is a natural choice to seek a group sparse solution with high performance, wherein the zero groups refer to the redundant structures, and the non-zero groups exhibit the prediction power to maintain competitive performance to the full model. We formulate an optimization problem with a group sparsity constraint in the form of ZIGs G as (1) and propose a novel Dual Half-Space Projected Gradient (DHSPG) to solve it. minimize x∈R n f (x), s.t. Card{g ∈ G|[x] g = 0} = K, ( ) where K is the target group sparsity level. Larger K indicates higher group sparsity in the solution and typically results in more aggressive FLOPs and parameter quantity reductions. Related Optimizers and Limitations. To solve such constrained problem, ADMM converts it into a min-max problem, but can not tackle the non-smooth and non-convex hard constraint of sparsity without hurting the objective, thus necessitates extra fine-tuning afterwards (Lin et al., 2019) . HSPG in OTOv1 (Chen et al., 2021b) and proximal methods (Xiao & Zhang, 2014) relax it into a nonconstrained mixed ℓ 1 /ℓ p regularization problem, but can not guarantee the sparsity constraint because of the implicit relationship between the regularization coefficient and the sparsity level. In addition, the augmented regularizer penalizes the magnitude of the entire trainable variables which restricts the search space to converge to the local optima nearby the origin point, e.g., x * 1 in Figure 3 . However, the local optima with the highest generalization may locate variably for different applications, and some may stay away from the origin point, e.g., x * 2 , • • • , x * 5 in Figure 3 . Algorithm Outline for DHSPG. To resolve the drawbacks of the existing optimization algorithms for solving (1), we propose a novel algorithm, named Dual Half-Space Projected Gradient (DHSPG), stated as Algorithm 3, with two takeaways. O x * 2 x * 1 x * 3 x * 4 x * 5 ∞ ∞ [x]1 [x]2 Algorithm 3 Dual Half-Space Projected Gradient (DHSPG) 1: Input: initial variable x 0 ∈ R n , Partition Groups. Partition Groups. To avoid always trapping in the local optima near the origin point, we further partition the groups in G into two subsets: one has magnitudes of variables being penalized G p , and the other does not force to penalize variable magnitude G np . Different criteria can be applied here to construct the above partition based on salience scores, e.g., cosinesimilarity cos (θ g ) between the projection direction -[x] g and the negative gradient or its estimation -[∇f (x)] g . Higher cos-similarity over g ∈ G indicates that projecting the group of variables in g onto zeros is more likely to make progress to the optimality of f (considering the descent direction from the perspective of optimization). The magnitude over [x] g then needs to be penalized. Therefore, we compute G p by picking up the ZIGs with top-K highest salience scores and G np as its complementary as (2). To compute more reliable scores, the partition is proceeded after performing T w warm-up steps as line 2-3. G p = (Top-K) arg max g∈G salience-score(g) and G np = {1, 2, • • • , n}\G p . Update Variables. Update Variables. For the variables in G np of which magnitudes are not penalized, we proceed vanilla stochastic gradient descent or its variants, such as Adam (Kingma & Ba, 2014), i.e., [x t+1 ] Gnp ← [x t ] Gnp -α t [∇f (x t )] Gnp . For the groups of variables in G p to penalize magnitude, we seek to find out redundant groups as zero, but instead of directly projecting them onto zero as ADMM which easily destroys the progress to the optimum, we formulate a relaxed non-constrained subproblem as (3) to gradually reduce the magnitudes without deteriorating the objective and project groups onto zeros if the projection serves as a descent direction during the training process. minimize where λ g is a group-specific regularization coefficient and needs to be dedicately chosen to guarantee the decrease of both the variable magnitude for g as well as the objective f . In particular, we compute a negative subgradient of ψ as the search direction [x] Gp ψ([x] Gp ) := f [x] Gp + g∈Gp λ g ∥[x] g ∥ 2 , [d(x)] Gp := -[∇f (x)] Gp -g∈Gp λ g [x] g / max{∥[x] g ∥ 2 , τ } with τ as a safeguard constant. To ensure [d(x) ] Gp as a descent direction for both f and ∥x∥ 2 , [d(x)] g needs to fall into the intersection between the dual half-spaces with normal directions as -[∇f ] g and -[x] g for any g ∈ G p as shown in Figure 4 . In other words, [d(x)] ⊤ Gp [-∇f (x)] Gp and [d(x)] ⊤ Gp [-x] Gp are greater than 0. It further indi- cates that λ g locates in the interval (λ min,g , λ max,g ) := -cos(θ g ) ∥[∇f (x)] g ∥ 2 , - ∥[∇f (x)]g∥ 2 cos (θg) if cos (θ g ) < 0 otherwise can be an arbitrary positive constant. Such λ g brings the decrease of both the objective and the variable magnitude. We then compute a trial iterate [ xt+1 ] Gp ← [x t -α t d(x t )] Gp via the subgradient descent of ψ as line 8. The trial iterate is fed into the Half-Space projector (Chen et al., 2021b) which outperforms proximal operators to yield group sparsity more productively without hurting the objective as line 9-10. Remark here that OTOv1 utilizes a global coefficient λ for all groups, thus lacks sufficient capability to guarantee both aspects for each individual group in accordance. Convergence and Complexity Analysis. DHSPG converges to the solution of (1) x * DHSPG in the manner of both theory and practice. In fact, the theoretical convergence relies on the the construction of dual half-space mechanisms which yield sufficient decrease for both objective f and variable magnitude, see Lemma 2 and Corollary 1 in Appendix C. Together with the sparsity recovery of Half-Space projector (Chen et al., 2020b, Theorem 2) , DHSPG effectively computes a solution with desired group sparsity. In addition, DHSPG consumes the same time complexity O(n) as other first-order methods, such as SGD and Adam, since all operations can be finished within linear time. 

3.3. AUTOMATED COMPRESSED MODEL CONSTRUCTION

G = {g 1 , g 2 , • • • , g 5 } and [x * DHSPG ] g2∪g3∪g4 = 0. In the end, given the solution x * DHSPG with both high performance and group sparsity, we now automatically construct a compact model which is a manual step with unavoidable substantial engineering efforts in OTOv1. In general, we traverse all vertices with trainable parameters, then remove the structures in accordance with ZIGs being zero, such as the dotted rows of K 1 , K 2 , K 3 and scalars of b 2 , γ 1 , β 1 as illustrated in Figure 5 . Next, we erase the redundant parameters that affiliate with the removed structures of their incoming stem vertices to keep the operations valid, e.g., the second and third channels in g 5 are removed though g 5 is not zero. The automated algorithm is promptly complete in linear time via performing two passes of depth-first-search and manipulating parameters to produce a more compact model M * . Based on the property of ZIGs, M * returns the same inference outputs as the full M parameterized as x * DHSPG thus no further fine-tuning is necessary. complicated structures (see the visualizations in Appendix D). Then, we compare OTOv2 with other methods on the benchmark experiments to show its competitive (or even superior) performance. In addition, we conduct ablation studies of DHSPG versus HSPG on the popular super-resolution task and Bert (Vaswani et al., 2017) on Squad (Rajpurkar et al., 2016) in Appendix B. Together with autonomy, user-friendliness and generality, OTOv2 arguably becomes the new state-of-the-art. Sanity of Automated ZIG and Automated Compression. The foremost step is to validate the correctness of the whole framework including both algorithm designs and infrastructure developments. We select five DNNs with complex topological structures, i.e., StackedUnets, DenseNet (Huang et al., 2017) , ConvNeXt (Liu et al., 2022) and CARN (Ahn et al., 2018) (see Appendix B for details), as well as DemoNet in Section 3.1, all of which are not easily to be compressed via the existing non-automatic methods unless with sufficient domain knowledge and extensive handcrafting efforts. Remark here that StackedUnets consumes two input tensors, and is constructed by stacking two standard Unets (Ronneberger et al., 2015) with different downsamplers and aggregating the corresponding two outputs together. To intuitively illustrate the automated ZIG partition over these complicated structures, we provide the visualizations of the connected components of dependency in Appendix D. To quantitatively measure the performance of OTOv2, we further employ these model architectures onto a variety of benchmark datasets, e.g., Fashion-MNIST (Xiao et al., 2017) , SVNH (Netzer et al., 2011) , CIFAR10/100 (Krizhevsky & Hinton, 2009) and ImageNet (Deng et al., 2009) . The main results are presented in Table 1 . ResRep (Ding et al., 2021) , Group-HS (Yang et al., 2019) and GBN-60 (You et al., 2019) achieve over 76% accuracy but consume more FLOPs than OTOv2 and are not automated for general DNNs.

5. CONCLUSION

We propose OTOv2 that automatically trains a general DNN only once and compresses it into a more compact counterpart without pre-training or fine-tuning to significantly reduce its FLOPs and parameter quantity. The success stems from two major improvements upon OTOv1: (i) automated ZIG partition and automated compressed model construction; and (ii) DHSPG method to more reliably solve structured-sparsity problem. We leave the incorporation with NAS as future work.

A IMPLEMENTATION DETAILS

A.1 LIBRARY IMPLEMENTATION The implementation of the current version of OTOv2 (February, 2023) depends on Pytorch (Paszke et al., 2019) and ONNX (Bai et al., 2019) which is an open industrial standard for machine learning interoperability and widely used in numerous AI products of the majority of top-tier industries (see the partners in https://onnx.ai/). In particular, the operators and the connectivity of a general DNN are retrieved by calling the ONNX optimization API of Pytorch, which is the first step to establish the trace graph of OTOv2 in Algorithm 2. The proposed DHSPG is implemented as an instance of the optimizer class for Pytorch. The ultimate compact model construction in Section 3.3 is implemented by modifying the attributes and parameters of the vertices in onnx models according to x * DHSPG and ZIGs. As a result, OTOv2 realizes an end-to-end pipeline to automatically and conveniently produce a compact model that meets inference restrictions and can be directly deployed onto product environments. In addition, the constructed compact DNNs in onnx format can be converted back into either torch or tensorflow formats if needed by open-source tools (Arseny, 2022) .

A.2 LIMITATIONS OF BETA VERSION

Dependency. The current version of OTOv2 (February, 2023) depends on the ONNX optimization API in Pytorch to obtain vertices (operations) and the connections among them, i.e., (E, V) in line 2 in Algorithm 2. It is the foremost step for establishing the trace graph of DNN for automated ZIG partition. Therefore, the DNNs that do not comply with this API are not supported by the beta version of OTOv2 yet. We notice that Transformers sometimes have incompatibility issue for its position embedding layers, whereas its trainable part such as encoder layers does not. The current limitation is in the view of engineering perspective and would be resolved following the active and rapid developments of the ONNX and PyTorch community driven by the industry and academy. Unknown Operators. For sanity, we exclude the connected components that possess uncertain/unknown vertices for forming ZIGs in Algorithm 2. This mechanism largely ensures the generality of the automated ZIG partition onto general DNNs. But the ignorance over these connected components may miss some valid ZIGs thereby may leave redundant structures to be unpruned. We will maintain and update the operator list which currently consists of 31 (known/certain) operators to better exploit ZIGs.

A.3 EXPERIMENTAL DETAILS

We conducted the experiments on one NVIDIA A100 GPU Server. For the experiments in the main body, we estimated the gradient via sampling a mini-batch of data points under first-order momentum with coefficient as 0.9. The mini-batch sizes follow other related works from {64, 128, 256}. All experiments in the main body share the same commonly-used learning rate scheduler that start from 10 -1 and periodically decay by 10 till 10 -4 every T period epochs. The length of decaying period T period depends on the maximum epoch, i.e., 120 for ImageNet and 300 for others. In general, we follow the usage of Half-Space projector in (Chen et al., 2021b) to trigger it when the learning rate is first decayed after T period epochs. G p and G np are constructed after T period /2 warm up epochs empirically in our experiments. To compute the saliency score, we jointly consider both cosine-similarity and magnitude of each group g ∈ G. For the groups g ∈ G p which magnitudes need to be penalized, we set λ g in Algorithm 3 as λ g = Λ := 10 -3 if the regularization coefficient does not need to be adjusted, i.e., cos (θ g ) ≥ 0. Note that Λ := 10 -3 is the commonly used coefficient in the sparsity regularization literatures (Chen et al., 2021b; Xiao & Zhang, 2014) . Otherwise, we computed the λ min,g := -cos(θ g ) ∥[∇f (x)] g ∥ 2 and λ max,g := - ∥[∇f (x)]g∥ 2 cos (θg) and set λ g by amplifying λ min,g by 1.1 and projecting back to λ max,g if exceeding.

B EXTENSIVE ABLATION STUDY

In this appendix, we present additional experiments to demonstrate the superiority of DHSPG over finding local optima with higher performance than HSPG. As described in the main body, the main advantages of DHSPG compared with HSPG in OTOv1 are (i) enlarging search space to be capable of finding local optima with higher performance if any, and (ii) more reliably guarantee ultimate group sparsity level. The later one has been demonstrated by the experiments of ResNet50, where DHSPG can precisely achieve different prescribed group sparsity level to meet the requirements of various deploying environments. In contrast, HSPG lacks capacity to achieve a specific group sparsity level due to the implicit relationship between the regularization coefficient λ and the sparsity level. The former one will be validated in this appendix. In depth, one takeaway of DHSPG is to separate groups of variables, then treat them via different and specifically designed mechanisms which greatly enlarge the search space. However, HSPG applies the same mechanism to update all variables, which may easily result in convergence to the origin and may be not optimal. In addition, we also provide the runtime comparison in Appendix B.3.

B.1 SUPER RESOLUTION

We select the popular model architecture CARN (Ahn et al., 2018) for super-resolution task with the scaling factor of two. As (Oh et al., 2022) , we use benchmark DIV2K dataset (Agustsson & Timofte, 2017) for training and Set14 (Zeyde et al., 2010) , B100 (Martin et al., 2001) and Ur-ban100 (Huang et al., 2015) datasets for evaluation. Similarly to other experiments presented in the main body, OTOv2 automatically partitions the trainable variables of CARN into ZIGs (see Figure 8 in Appendix D). Then we follow the training procedure in (Agustsson & Timofte, 2017) to apply the Adam strategy into DHSPG, i.e., utilizing both first and second order momentums to compute a gradient estimate as line 5 in Algorithm 3. Under the same learning scheduler and total number of steps as the baseline, we conduct both DHSPG and HSPG to compute solutions with high group sparsity, where we set target group sparsity as 50% for DHSPG and fine-tune the regularization coefficient λ for HSPG as the power of 10 from 10 -3 to 10 3 to pick up the one with significant FLOPs and parameters reductions with satisfactory performance. Finally, the more compact CARN models are constructed via the automated compressed model construction in Section 3.3. We report the final results in Table 4 . Unlike the classification experiments where HSPG and DHSPG perform quite competitively, OTOv2 with DHSPG significantly outperforms OTOv2 with HSPG on this super-resolution task via CARN by using 46% fewer FLOPs and parameters to achieve significantly better PSNR on these benchmark datasets. It exhibits a strong evidence to show the higher generality of DHSPG to enlarge the search space rather than restrict it near the origin point to fit more general applications.

B.2 BERT ON SQUAD

We next compare DHSPG versus HSPG on pruning the large-scale transformer Bert (Vaswani et al., 2017) , evaluated on Squad, a question-answering benchmark (Rajpurkar et al., 2016) . Remark here that since Transformer are not reliably compatible with PyTorch's ONNX optimization API at this moment, they can not enjoy the end-to-end autonomy of OTOv2 yet. To compare two optimizers, we apply DHSPG onto the OTOv1 framework, which manually conducts ZIG partition and constructs compressed Bert without fine-tuning. The results are reported in Table 5 . Based on Table 5 , it is apparent that DHSPG performs significantly better than HSPG and Prox-SSI (Deleu & Bengio, 2021 ) by achieving 83.8%-87.7% F1-scores and 74.6%-80.0% exact match rates. In constrast, HSPG and ProxSSI reach 82.0%-84.1% F1-scores and 71.9%-75.0% exact match rates. The underlying reason of such remarkable improvement by DHSPG is that DHSPG enlarges to search space away from trapping around origin points by partitioning groups into magnitudepenalized or not and treating them separately. However, both ProxSSI and HSPG penalize the magnitude of all variables and apply the same update mechanism onto all variables, which deteriorate the performance significantly in this experiment. The results well validate the effectiveness of the design of DHSPG to enlarge search space for typical better generalization performance. 

C CONVERGENCE ANALYSIS

In this appendix, we provide the rough convergence analysis for the proposed DHSPG. This paper is application-track and mainly focuses on deep learning compression and infrastructure but not the theoretical convergence. Therefore, for simplicity, we assume full gradient estimate at each iteration. More rigorous analysis under stochastic settings will be left as future work, Lemma 1. The objective function f satisfies f (x+αd(x)) ≤ f (x)-α - Lα 2 2 ∥∇f (x)∥ 2 2 + Lα 2 2 g∈Gp λ 2 g +(Lα-1)α g∈Gp λ g cos (θ g ) ∥[∇f (x)] g ∥ 2 . (4) Proof. By the algorithm, d(x) = -[∇f (x)] g -λ g [x]g ∥[x]g∥ 2 if g ∈ G p , -[∇f (x)] g otherwise. ( ) We can rewrite the direction [d(x)] g for g ∈ G p as the summation of two parts, [d(x)] g = d(x) + d(x) g , where d(x) ⊤ g [∇f (x)] g = 0, and d(x) g = -λ g [x] g ∥[x] g ∥ • cos (θ g -90 • ) = λ g sin (θ g ). (7) Consequently, d(x) g 2 = [d(x)] g 2 -d(x) g 2 = -[∇f (x)] g -λ [x] g ∥[x] g ∥ 2 -λ 2 g sin 2 (θ g ) = [∇f (x)] g 2 + λ 2 g + 2[∇f (x)] ⊤ g λ g [x] g ∥[x] g ∥ -λ 2 g sin 2 (θ g ) = ∥[∇f (x)] g ∥ 2 + λ 2 g cos 2 (θ g ) + 2λ g ∥[∇f (x)] g ∥ cos (θ g ) = [∥[∇f (x)] g ∥ + λ g cos (θ g )] 2 , and [ d(x)] g = - ∥[∇f (x)] g ∥ + λ g cos (θ g ) ∥[∇f (x)] g ∥ [∇f (x)] g := -ω g [∇f (x)] g (9) By the descent lemma, we have that f (x + αd(x)) ≤f (x) + α∇f (x) ⊤ d(x) + Lα 2 2 ∥d(x)∥ 2 2 =f (x) + α[∇f (x)] ⊤ Gnp [d(x)] Gnp + α[∇f (x)] ⊤ Gp [d(x)] Gp + Lα 2 2 [d(x)] Gnp 2 2 + Lα 2 2 [d(x)] Gp =f (x) -α - Lα 2 2 ∇f (x)] Gnp 2 + α[∇f (x)] ⊤ Gp [d(x)] Gp + Lα 2 2 [d(x)] Gp 2 2 =f (x) -α - Lα 2 2 ∇f (x)] Gnp 2 + α[∇f (x)] ⊤ Gp d(x) + d(x) Gp + Lα 2 2 d(x) + d(x) Gp 2 2 =f (x) -α - Lα 2 2 ∇f (x)] Gnp 2 + α[∇f (x)] ⊤ Gp d(x) Gp + Lα 2 2 d(x) Gp 2 2 + Lα 2 d(x) ⊤ Gp d(x) Gp + Lα 2 2 d(x) Gp 2 2 =f (x) -α - Lα 2 2 ∇f (x)] Gnp 2 -α [∇f (x)] ⊤ Gp g∈Gp [∇f (x)] g + λ g cos (θ g ) [∇f (x)] g [∇f (x)] g + Lα 2 2 g∈Gp λ 2 g sin 2 (θ g ) + Lα 2 2 g∈Gp ∥[∇f (x)] g ∥ 2 + λ g cos (θ g ) 2 =f (x) -α - Lα 2 2 ∇f (x)] Gnp 2 -α - Lα 2 2 [∇f (x)] Gp 2 2 -α g∈Gp λ g cos (θ g ) ∥[∇f (x)] g ∥ + Lα 2 2 g∈Gp λ 2 g sin 2 (θ g ) + cos 2 (θ g ) + Lα 2 g∈Gp λ g ∥[∇f (x)] g ∥ cos (θ g ) =f (x) -α - Lα 2 2 ∥∇f (x)∥ 2 2 + Lα 2 2 g∈Gp λ 2 g + (Lα -1)α g∈G λ g cos (θ g ) ∥[∇f (x)] g ∥ Lemma 2. Suppose α ≤ 1 L and f is L-smooth, then there exists some positive λ g ∈ (λ min,g , λ max,g ) for any g ∈ G p such as f (x + αd(x)) ≤ f (x) -α - Lα 2 2 [∇f (x)] Gnp 2 2 (10) Proof. Based on Lemma 1 and α ≤ 1 L , we have that f (x + αd(x)) ≤f (x) -α - Lα 2 2 ∥∇f (x)∥ 2 2 + Lα 2 2 g∈Gp λ 2 g + (Lα -1)α g∈G λ g cos (θ g ) ∥[∇f (x)] g ∥ 2 , ≤f (x) -α - Lα 2 2 [∇f (x)] Gnp 2 2 + g∈Gp h(λ g , g) (11) where we denote h(λ g , g) := Lα 2 λ 2 g 2 + (Lα -1)α cos (θ g ) ∥[∇f (x)] g ∥ 2 λ g -α - Lα 2 2 ∥[∇f (x)] g ∥ 2 2 . ( ) We can see that for any g ∈ G p , then h(λ, g) ≤ 0 if and only if the following holds λ g ≤ (1 -Lα)α cos (θ g ) ∥[∇f (x)] g ∥ 2 + (1 -Lα) 2 α 2 cos 2 (θ g ) ∥[∇f (x)] g ∥ 2 2 + 2Lα 2 α -Lα 2 2 ∥[∇f (x)] g ∥ 2 2 Lα 2 := λg . (13) Next, we need to show for the group g that requires λ g adjustment, the (λ min,g , λg ) is a valid interval, i.e., λg ≥ λ min,g . To show it, combining with λ min,g = -cos (θ g ) ∥[∇f (x)] g ∥ 2 , we have that λg = (Lα -1)αλ min,g + (1 -Lα) 2 α 2 λ 2 min,g + 2Lα 2 α -Lα 2 2 λ 2 min,g / cos 2 (θ g ) Lα 2 = (Lα -1)λ min,g + λ min,g 1 -L 2 α 2 tan 2 (θ g ) + 2Lα tan 2 (θ g ) Lα = (Lα -1)λ min,g + λ min,g -(Lα -1) 2 tan 2 (θ g ) + tan 2 (θ g ) + 1 Lα > (Lα -1)λ min,g + λ min,g Lα = λ min,g , where the second last inequality holds since 0 < α ≤ 1/L, then -(Lα -1) 2 tan 2 (θ g ) + tan 2 (θ g ) + 1 > 1. Then, we need to show that λg is greater than 0. There are two cases to be considered: • cos θ g < 0: then λg ≥ λ min,g = -cos (θ g ) ∥[∇f (x)] g ∥ 2 > 0. • cos θ g ≥ 0: it follows 0 < α < 1/L and (13) that λg > 0. Thus for λ g ∈ (λ min,g , min{λ max,g , λg }), h(λ g , g) ≤ 0. Consequently, there exists some positive λ g ∈ (λ min,g , λ max,g ) so that h(λ g , g) ≤ 0. Finally, the proof is complete if we choose λ g satisfies the above for any g ∈ G p , f (x + αd(x)) ≤ f (x) -α - Lα 2 2 [∇f (x)] Gnp 2 2 + g∈Gp h(λ, g) ≤ f (x) -α - Lα 2 2 [∇f (x)] Gnp 2 2 . ( ) Lemma 3. For any g ∈ G p , if 0 < α < 2[x] ⊤ g [d(x)]g ∥[d(x)]g∥ 2 2 , then the magnitude of the variables satisfies ∥[x + αd(x)] g ∥ 2 < ∥[x] g ∥ 2 . (16) And if α = ω [x] ⊤ g [d(x)]g ∥[d(x)]g∥ 2 2 for ω ∈ (0, 1), then ∥[x + αd(x)] g ∥ 2 2 = ∥[x] g ∥ 2 2 + (ω 2 -2ω) ∥[x] g ∥ 2 2 cos 2 (θ g ). ( ) Proof. ∥[x + αd(x)] g ∥ 2 2 = [x] g -α [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 2 2 = ∥[x] g ∥ 2 2 -2α[x] ⊤ g [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 + α 2 [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 2 2 = ∥[x] g ∥ 2 2 + t(α), ( ) where t(α) = -2α[x] ⊤ g [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 +α 2 [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 2 2 = Aα 2 -2Bα (19) and A := [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 2 2 > 0 (20) B :=[x] ⊤ g [∇f (x)] g + λ g [x] g ∥[x] g ∥ 2 > 0. ( ) note that B > 0 because of the selection of λ g . Consequently, we have that if 0 < α < 2B A = 2[x] ⊤ g [d(x)]g ∥[d(x)]g∥ 2 2 , then t(α) < 0. Finally, if α = ω [x] ⊤ g [d(x)]g ∥[d(x)]g∥ 2 2 = ω B A for ω ∈ (0, 2), then t(α) = Aω 2 B 2 A 2 -2Bω B A = (ω 2 -2ω) B 2 A = (ω 2 -2ω) ∥[x] g ∥ 2 2 cos 2 (θ g ), which completes the proof. Corollary 1. Suppose α = ω min g∈Gp [x] ⊤ g [d(x)]g ∥[d(x)]g∥ 2 2 for some ω ∈ (0, 1) and | cos(θ g )| ≥ ρ for g ∈ G p ∩ G ̸ =0 (x) and some positive ρ ∈ (0, 1]. Then there exists γ ∈ (0, 1) such that [x + αd(x)] Gp 2 2 ≤ (1 -γ 2 ) [x] Gp 2 2 . ( ) Proof. The result can be complete via summing (17) over G p and combining with α selection.

D ZIG ILLUSTRATION

In this appendix, for more intuitive illustration, we provide the visualizations of the connected components of dependency for the experimented DNNs throughout the whole paper. They are constructed by performing Algorithm 2 to automatically partition ZIGs. Due to the large-scale and intricacy of graphs, we recommend to zoom in for reading greater details (under 200%-1600% zoomin scale via Adobe PDF reader). The vertices marked as the same color represent one connected component of dependency. See the figures starting from the next pages. 



Figure 1: OTOv2 versus existing methods.

2: Automated ZIG Partition. Partition the trainable parameters of M into G.3: Train M by DHSPG. Seek a highly group-sparse solution x * DHSPG with high performance. 4: Automated Compressed Model M * Construction. Construct a slimmer model upon x * DHSPG . 5: Output: Compressed slimmed model M * .

Figure2: Automated ZIG partition illustration. K i and b i are the flatten filter matrix and bias vector of Convi, where the jth row of K i represents the jth 3D filter. γ i and β i are the weighting and bias vectors of BNi. W i and b wi are the weighting matrix and bias vector for Lineari. The ground truth ZIGs G are present in Figure2e. Since the output tensors of Conv2 and Conv3 are added together, both layers associated with the subsequent BN2 and BN3 must remove the same number of filters from K 2 and K 3 and scalars from b 2 , b 3 , γ 2 , γ 3 , β 2 and β 3 to keep the addition valid. Since BN4 normalizes the concatenated outputs along channel from Conv1-BN1-ReLu and Conv3+Conv2-BN2|BN3, the corresponding scalars in γ 4 , β 4 need to be removed simultaneously.

Figure 3: Local optima x * ∈ R 2 distribution over the objective landscape.

Figure 4: Search direction in DHSPG.

Figure 5: Automated compressed model construction. G = {g 1 , g 2 , • • • , g 5 } and [x * DHSPG ] g2∪g3∪g4 = 0.

Figure 7: Average runtime per epoch relative comparison.

Figure 8: CARN.

Figure 9: DemoNet.

Figure 10: DenseNet121.

Figure 11: ResNet50.

Figure 12: StackedUnets.

Figure 13: ConvNeXt-Tiny.

Figure 14: VGG16 and VGG16-BN.

initial learning rate α 0 , warm-up steps T w , half-space project steps T h , target group sparsity K and ZIGs G. 2: Warm up T w steps via stochastic gradient descent. 3: Construct G p and G np given G and K as (2). 4: for t = T w , T w + 1, T w + 2, • • • , do Gnp as [x t -α t ∇f (x t )] Gnp .

OTOv2 on extensive DNNs and datasets. We first consider vanilla VGG16 and a variant referred as VGG16-BN that appends a batch normalization layer after every convolutional layer. OTOv2 automatically exploits the minimal removal structures of VGG16 and partitions the trainable variables into ZIGs (see. DHSPG is then triggered over the partitioned ZIGs to train the model from scratch to find a solution with high group sparsity. Finally, a slimmer VGG16 is automatically constructed without any fine-tuning. As shown in Table2, the slimmer VGG16 leverages only 2.5% of parameters to dramatically reduce the FLOPs by 86.6% with the competitive top-1 accuracy to the full model and other state-of-the-art methods. Likewise, OTOv2 compresses VGG16-BN to maintain the baseline accuracy by the fewest 4.9% of parameters and 23.7% of FLOPs. Though SCP and RP reach higher accuracy, they require significantly 43%-102% more FLOPs than that of OTOv2.

ResNet50 for CIFAR10.

VGG16 and VGG16-BN for CIFAR10. Convolutional layers are in bold.Miao et al., 2021) do not construct slimmer models besides merely projecting variables onto zero. OTOv2 overcomes all these drawbacks and is the first to realize the end-to-end autonomy for simultaneously training and compressing arbitrary DNNs with high performance. Furthermore, OTOv2 achieves the state-of-the-art results on this intersecting ResNet50 on CIFAR10 experiment. In particular, as shown in Table3, under 90% group sparsity level, OTOv2 utilizes only 1.2% parameters and 2.2% FLOPs to reach 93.0% top-1 accuracy with slight 0.5% regression. Under 80% group sparsity, OTOv2 achieves competitive 94.5% accuracy to other pruning methods but makes use of substantially fewer parameters and FLOPs. Pareto frontier in terms of top-1 accuracy and FLOPs reduction under various group sparsities. In particular, under 70% group sparsity, the slimmer ResNet50 by OTOv2 achieves fewer FLOPs (14.5%) than others with a 70.3% top-1 accuracy which is competitive to SFP(He et al., 2018a)  and RBP(Zhou et al., 2019) especially under 3x fewer FLOPs. The one with 72.3% top-1 accuracy under group sparsity as 60% is competitive to CP(He et al., 2017), DDS-26(Huang & Wang, 2018) and RRBP(Zhou et al., 2019), but 2-3 times more efficient. The slimmer ResNet50s under 40% and 50% group sparsity achieve the accuracy milestone, i.e., around 75%, both of which FLOPs reductions outperform most of state-of-the-arts.

OTOv2 under DHSPG versus HSPG on CARNx2.

Numerical results of Bert on Squad. Approximate value based on the group sparsity reported in(Deleu & Bengio, 2021).We provide runtime comparison of the proposed DHSPG versus the optimization algorithms used in the benchmark baseline experiments. In particular, we calculate the average runtime per epoch and report the relative runtime in Figure7. Based on Figure7, we observe that the per epoch cost of DHSPG is competitive to other standard optimizers. In addition, OTOv2 only trains the DNN once with the similar amount of total epochs for training the baselines. However, other compression methods have to proceed multi-stage training procedures including pretraining the baselines by the standard optimizers, thereby are not training efficient compared to OTOv2.

4. NUMERICAL EXPERIMENTS

We develop OTOv2 to train and compress DNNs into slimmer networks with significant inference speedup and storage saving without fine-tuning. The implementation details are presented in Appendix A. To demonstrate its effectiveness, we first verify the correctness of automated ZIG partition and automated compact model construction by employing OTOv2 onto a variety of DNNs with

funding

Partially supported by NSF grant CCF-2240708.Library Usage 1 from only train once import OTO

availability

out-999 batchnorm out-1000 relu out-1001 conv3x3 out-1002 batchnorm out-1004 relu out-1005 conv1x1 out-1006 batchnorm out-1007 relu out-1008 conv3x3 out-1009 batchnorm out-1011 relu out-1012 conv1x1 out-1013 batchnorm out-1014 relu out-1015 conv3x3 out-1016 batchnorm out-1018 relu out-1019 conv1x1 out-1020 batchnorm out-1021 relu out-1022 conv3x3 out-1023 batchnorm out-1025 relu out-1026 conv1x1 out-1027 batchnorm out-1028 relu out-1029 conv3x3 out-1030 batchnorm out-1032 relu out-1033 conv1x1 out-1034 batchnorm out-1035 relu out-1036 conv3x3 out-1037 batchnorm out-1039 relu out-1040 conv1x1 out-1041 pad out-1042 averagepool2x2 out-1043 concat out-1044 concat out-1051 concat out-1058 concat out-1065 concat out-1072 concat out-1079 concat out-1086 concat out-1093 concat out-1100 concat out-1107 concat out-1114 concat out-1121 concat out-1128 concat out-1135 concat out-1142 concat out-1149 concat out-1156 batchnorm out-1045 relu out-1046 conv1x1 out-1047 batchnorm out-1048 relu out-1049 conv3x3 out-1050 batchnorm out-1052 relu out-1053 conv1x1 out-1054 batchnorm out-1055 relu out-1056 conv3x3 out-1057 batchnorm out-1059 relu out-1060 conv1x1 out-1061 batchnorm out-1062 relu out-1063 conv3x3 out-1064 batchnorm out-1066 relu out-1067 conv1x1 out-1068 batchnorm out-1069 relu out-1070 conv3x3 out-1071 batchnorm out-1073 relu out-1074 conv1x1 out-1075 batchnorm out-1076 relu out-1077 conv3x3 out-1078 batchnorm out-1080 relu out-1081 conv1x1 out-1082 batchnorm out-1083 relu out-1084 conv3x3 out-1085 batchnorm out-1087 relu out-1088 conv1x1 out-1089 batchnorm out-1090 relu out-1091 conv3x3 out-1092 batchnorm out-1094 relu out-1095 conv1x1 out-1096 batchnorm out-1097 relu out-1098 conv3x3 out-1099 batchnorm out-1101 relu out-1102 conv1x1 out-1103 batchnorm out-1104 relu out-1105 conv3x3 out-1106 batchnorm out-1108 relu out-1109 conv1x1 out-1110 batchnorm out-1111 relu out-1112 conv3x3 out-1113 batchnorm out-1115 relu out-1116 conv1x1 out-1117 batchnorm out-1118 relu out-1119 conv3x3 out-1120 batchnorm out-1122 relu out-1123 conv1x1 out-1124 batchnorm out-1125 relu out-1126 conv3x3 out-1127 batchnorm out-1129 relu out-1130 conv1x1 out-1131 batchnorm out-1132 relu out-1133 conv3x3 out-1134 batchnorm out-1136 relu out-1137 conv1x1 out-1138 batchnorm out-1139 relu out-1140 conv3x3 out-1141 batchnorm out-1143 relu out-1144 conv1x1 out-1145 batchnorm out-1146 relu out-1147 conv3x3 out-1148

