MODEL COMPRESSION VIA HYPER-STRUCTURE NET-WORK

Abstract

In this paper, we propose a novel channel pruning method to solve the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Previous channel pruning methods usually ignore the relationships between channels and layers. Many of them parameterize each channel independently by using gates or similar concepts. To fill this gap, a hyper-structure network is proposed to generate the architecture of the main network. Like the existing hypernet, our hyperstructure network can be optimized by regular backpropagation. Moreover, we use a regularization term to specify the computational resource of the compact network. Usually, FLOPs is used as the criterion of computational resource. However, if FLOPs is used in the regularization, it may over penalize early layers. To address this issue, we further introduce learnable layer-wise scaling factors to balance the gradients from different terms, and they can be optimized by hyper-gradient descent. Extensive experimental results on CIFAR-10 and ImageNet show that our method is competitive with state-of-the-art methods.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have accomplished great success in many machine learning and computer vision tasks (Krizhevsky et al., 2012; Redmon et al., 2016; Ren et al., 2015; Simonyan & Zisserman, 2014a; Bojarski et al., 2016) . To deal with real world applications, recently, the design of CNNs becomes more and more complicated in terms of width, depth, etc. (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014b; He et al., 2016; Huang et al., 2017) . Although these complex CNNs can attain better performance on benchmark tasks, their computational and storage costs increase dramatically. As a result, a typical application based on CNNs can easily exhaust an embedded or mobile device due to its enormous costs. Given such costs, the application can hardly be deployed on resource-limited platforms. To tackle these problems, many methods (Han et al., 2015b; a) have been devoted to compressing the original large CNNs into compact models. Among these methods, weight pruning and structural pruning are two popular directions. Unlike weight pruning or sparsification, structural pruning, especially channel pruning, is an effective way to truncate the computational cost of a model because it does not require any post-processing steps to achieve actual acceleration and compression. Many existing works (Liu et al., 2017; Ye et al., 2018; Huang & Wang, 2018; Kim et al., 2020; You et al., 2019) try to solve the problem of structure pruning by applying gates or similar concepts on channels of a layer. Although these ideas have achieved many successes in channel pruning, there are some potential problems. Usually, each gate has its own parameter, but parameters from different gates do not have dependence. As a result, they can hardly learn inter-channel or inter-layer relationships. Due to the same reason, the slimmed models from these methods could overlook the information between different channels and layers, potentially bringing sub-optimal model compression results. To address these challenges, we propose a novel channel pruning method inspired by hypernet (Ha et al., 2016) . In hypernet, they propose to use a hyper network to generate the weights for another network, while the hypernet can be optimized through backpropagation. We extend a hypernet to a hyper-structure network to generate an architecture vector for a CNN instead of weights. Each architecture vector corresponds to a sub-network from the main (original) network. By doing so, the inter-channel and inter-layer relationships can be captured by our hyper-structure network. Besides the hyper-structure network, we also introduce a regularization term to control the computational budget of a sub-network. Recent model compression methods focus on pruning computational FLOPs instead of parameters. The problem of applying FLOPs regularization is that the gradients of the regularization will heavily penalize early layers which can be regarded as a bias towards latter layers. Such a bias will restrict the potential search space of sub-networks. To make our hyper-structure network explore more possible structures, we further introduce layer-wise scaling factors to balance the gradients from different losses for each layer. These factors can be optimized by hyper-gradient descent. Our contributions are summarized as follows: 1) Inspired by hypernet, we propose to use a hyper-structure network for model compression to capture inter-channel and inter-layer relationships. Similar to hypernet, the proposed hyper-structure network can be optimized by regular backpropagation. 2) Gradients from FLOPs regularization are biased toward latter layers, which truncate the potential search space of a sub-network. To balance the gradients from different terms, layerwise scaling factors are introduced for each layer. These scaling factors can be optimized through hyper-gradient descent with trivial additional costs. 3) Extensive experiments on CIFAR-10 and ImageNet show that our method can outperform both conventional channel pruning methods and AutoML based pruning methods on ResNet and MobileNetV2.

2.1. MODEL COMPRESSION

Recently, model compression has drawn a lot of attention from the community. Among all model compression methods, weight pruning and structural pruning are two popular directions. Weight pruning eliminates redundant connections without assumptions on the structures of weights. Weight pruning methods can achieve a very high compression rate while they need specially designed sparse matrix libraries to achieve acceleration and compression. As one of the early works, Han et al. (2015b) proposes to use L 1 or L 2 magnitude as the criterion to prune weights and connections. SNIP (Lee et al., 2019) updates the importance of each weight by using gradients from loss function. Weights with lower importance will be pruned. Lottery ticket hypothesis (Frankle & Carbin, 2019) assumes there exist high-performance sub-networks within the large network at initialization time. They then retrain the sub-network with the same initialization. In rethinking network pruning (Liu et al., 2019b) , they challenging the typical model compression process (training, pruning, fine-tuning), and argue that fine-tuning is not necessary. Instead, they show that training the compressed model from scratch with random initialization can obtain better results. One of the previous works (Li et al., 2017) in structural pruning uses the sum of the absolute value of kernel weights as the criterion for filter pruning. Instead of directly pruning filters based on magnitude, structural sparsity learning (Wen et al., 2016) is proposed to prune redundant structures with Group Lasso regularization. On top of structural sparsity, GrOWL regularization is applied to make similar structures share the same weights (Zhang et al., 2018) . One of the problems when using Group Lasso is that weights with small values could still be important, and it's difficult for structures under Group Lasso regularization to achieve exact zero values. As a result, Louizos et al. (2018) propose to use explicit L 0 regularization to make weights within structures have exact zero values. Besides using the magnitude of structure weights as a criterion, other methods utilize the scaling factor of batchnorm to achieve structure pruning, since batchnorm (Ioffe & Szegedy, 2015) is widely used in recent neural network designs (He et al., 2016; Huang et al., 2017) . A straightforward way to achieve channel pruning is to make the scaling factor of batchnorm to be sparse (Liu et al., 2017) . If the scaling factor of a channel fell below a certain threshold, then the channel will be removed. The scaling factor can also be regarded as the gate parameter of a channel. Methods related to this concept include (Ye et al., 2018; Huang & Wang, 2018; Kim et al., 2020; You et al., 2019) . Though it has achieved many successes in channel pruning, using gates can not capture the relationships between channels and across layers. Besides using gates, Collaborative channel pruning (Peng et al., 2019) try to prune channels by using Taylor expansion. Our method is also related to Automatic Model Compression(AMC) (He et al., 2018b) . In AMC, they use policy gradient to update the policy network, which potentially provides both inter-channel and inter-layer information. However, the high variance of policy gradient makes it less efficient and effective compared to our method. In this paper, we focus on channel pruning, since it provides a natural way to reduce computation and memory costs. Besides weight and channel pruning methods, there are works from other perspectives, including bayesian pruning (Molchanov et al., 2017; Neklyudov et al., 2017) , weight quantization (Courbariaux et al., 2015; Rastegari et al., 2016) , and knowledge distillation (Hinton et al., 2015) .

2.2. HYPERNET

Hypernet (Ha et al., 2016) was introduced to generate weights for a network by using a hyper network. Hyper networks have been applied to many machine learning tasks. von Oswald et al. ( 2020) uses a hyper network to generate weights based on task identity to combat catastrophic forgetting in continual learning. MetaPruning (Liu et al., 2019a) utilizes a hyper network to generate weights when performing evolutionary algorithm. SMASH (Brock et al., 2018) is a neural architecture search method that can predict the weights of a network given its architecture. GPN (Zhang et al., 2019) extends the idea of SMASH and can be used on any directed acyclic graph. Other applications include Bayesian neural networks (Krueger et al., 2017) , multi-task learning (Pan et al., 2018) , generative models (Suarez, 2017) and so on. Different from original hyper network, the proposed hyper-structure network aims to generate the architecture of a sub-network.

3.1. NOTATIONS

To better describe our proposed approach, necessary notations are introduced first. In a CNN, the feature map of ith layer can be represented by F i ∈ Ci×Wi×Hi , i = 1, . . . , L, where C i is the number of channels, W i and H i are height and width of the current feature map, L is the number of layers. The mini-batch dimension of feature maps is ignored to simplify notations. sigmoid(•) is the sigmoid function. round(•) rounds inputs to nearest integers.

3.2. HYPER-STRUCTURE NETWORK

In the context of channel pruning, we need to decide whether a channel should be pruned or not. We can use 0 or 1 to depict the removal or keep of a channel. Consequently, the architecture of a sub-network can be represented as a concatenated vector (containing 0 or 1) from all layers. Our goal is then to use a neural network to generate this vector to represent the corresponding sub-network. After layer-wise scaling, the potential search space of architectures for a sub-network is increased. For ith layer, the following output vector is generated: o i = HSN(a i ; Θ), where HSN is our proposed hyper-structure network composed of gated recurrent unit (GRU) (Cho et al., 2014) and dense layers, a i is a fixed random vector generated from a uniform distribution U(0, 1), and Θ is the parameter of HSN. The detailed setup of HSN can be found in Appendix C. In short, GRU is used to capture sequential relationships between layers, and dense layers are capable of capturing inter-channel relationships. Note that a i is a constant vector during training, if a i is randomly sampled, it will make learning more difficult and result in sub-optimal performance. Now we have the output o i , we need to convert it to a 0-1 vector to evaluate the sub-network. The binarization process can be demonstrated by the following equations: z i = sigmoid((o i + g)/τ ), v i = round(z i ), and v i ∈ {0, 1} Ci , where g follows Gumbel distribution: g ∼ Gumbel(0, 1), v i is the architecture vector of ith layer, and τ is the temperature hyper-parameter. Since the round operation is not differentiable, we use straight through estimator (STE) (Bengio et al., 2013) to enable gradient calculation: ∂J ∂zi = ∂J ∂vi . This process can be summarized as using ST Gumbel-Softmax (Jang et al., 2016) with fixed temperature to approximate Bernoulli distribution. The idea of HSN can also be viewed as mapping from constant vectors {a i } L i=1 to the architecture of a sub-network. When we evaluate a sub-network, the feature map of ith layer is modified as follows: F i = vi F i , where is element-wise multiplication, vi is the expanded version of v i , and vi has the same size of F i . The feature map F i is from the output of Conv-Bn-Relu block. The overall loss function is: min Θ J (Θ) := L f (x; W, v), y + λR(T (v), pT total ) where v = (v 1 , . . . , v L ), T (v) is the current FLOPs decided by the architecture vector v, T total is the total FLOPs of the original model, p ∈ (0, 1] is a predefined parameter deciding the remaining fraction of FLOPs, λ is the hyper-parameter controlling the strength of FLOPs regularization, f (x; W, v) is the CNN parameterized by W and the sub-network structure is determined by architecture vector v, L is the cross entropy loss function and R is the regularization term for FLOPs, and Θ again is the parameters of HSN. The regularization term R used in this paper is R(T (v), pT total ) = log(|T (v) -pT total | + 1).

3.3. LAYER-WISE SCALING

The FLOPs regularization considered in Eq. 4 will heavily penalize layers with a larger amount of FLOPs (early layers for most architectures). Consequently, the resulting architecture from the original FLOPs regularization will have a larger pruning rate at early layers. The alternative architectures with similar FLOPs could be omitted. This phenomenon is also demonstrated in Fig. 2 . Further analysis is provided in Appendix B. To alleviate the problem caused by original FLOPs regularization, we introduce layer-wise scaling factors to dynamically balance gradients from the regularization term R and the loss term L. gradients in dense layers are balanced since GRU is shared by all layers. The gradients w.r.t the parameters of ith dense layer can be written in the following equation: t i = α t-1 i -β ∂J(u(θ t-2 i ,α t-1 i )) ∂αi , i = 1, . . . , L. ∂J ∂θ i = α i ∂L ∂θ i + λ ∂R ∂θ i , where θ i is the parameter of ith dense layer, α i is the layer-wise scaling factor for ith layer. If no layer-wise scaling is applied, α i = 1. α i can be regarded as a balancing factor between ∂R ∂θi and ∂L ∂θi . α i only appears in gradient calculation, as a result, it can not be directly optimized. To optimize α i , we follow similar deriving process from (Baydin et al., 2018) . We first define the update rule θ t i = u(θ t-1 i , α t i ) and it can be applied to any optimization algorithms. For example, under stochastic gradient descent, u(θ t-1 i , α t i ) = θ t-1 i -η(α t i ∂L ∂θ t-1 i + λ ∂R ∂θ t-1 i ). Ideally, our goal is to update α i so that the corresponding architecture can obtain lower loss value with loss function J . To do so, we want to min αi J(u(θ t-1 i , α t i )) before update θ i . For simplicity, the expectation is omitted. The hyper-gradient with respect to α i can be calculated by: ∂J(u(θ t-1 i , α t i )) ∂α i = ( ∂J ∂u ) T ∂u(θ t-1 i , α t i ) ∂α i = ( ∂J ∂θ t i ) T ∂u(θ t-1 i , α t i ) ∂α i . Given the hyper-gradient of α i , it can be updated by regular gradient descent method. In experiments, the update rule is ADAM optimizer (Kingma & Ba, 2014), and the detail derivation of ∂J(u(θ t-1 i ,α t i )) ∂αi for ADAM optimizer is described in Appendix E.

3.4. MODEL COMPRESSION VIA HYPER-STRUCTURE NETWORK

In Fig. 6 , we provide the flowchart of HSN. The overall algorithm of model compression via hyperstructure network is shown in Alg. 1. As shown in Alg. 1, our method can prune any pre-trained CNNs without modifications. It should be emphasized again that the gradient of GRU is not affected by α i , which is simply ∂L ∂θGRU + λ ∂R ∂θGRU , and θ GRU is the parameter for GRU. Moreover, HSN does not need a whole dataset for training, and a small fraction of the dataset is enough, which makes the training of HSN quite efficient. After the training of HSN, we then use HSN to generate an architecture vector v, and prune the model according to this vector. Also, note that there is certain randomness (Eq. 2 approximates Bernoulli distribution) when generating v, but we find that there is no need to generate the vector multiple times, and average them or conduct majority vote. When generating the vector multiple times, most parts of vectors are the same, the different parts are trivial and do not have impacts on the final performance.

Method

Architecture Baseline Acc Pruned Acc ∆-Acc ↓ FLOPs Channel Pruning (He et al., 2017) ResNet-56 92.80% 91.80% -1.00% 50.0% AMC (He et al., 2018b) 92.80% 91.90% -0.90% 50.0% Pruning Filters (Li et al., 2017) 93.04% 93.06% +0.02% 27.6% Soft Prunings (He et al., 2018a) 93.59% 93.35% -0.24% 52.6% DCP (Zhuang et al., 2018) 93.80% 93.59% -0.31% 50.0% DCP-Adapt (Zhuang et al., 2018) 93.80% 93.81% +0.01% 47.0% CCP (Peng et al., 2019) 93.50% 4 EXPERIMENTAL RESULTS

4.1. IMPLEMENTATION DETAILS

Similar to many model compression works, CIFAR-10 ( Krizhevsky & Hinton, 2009) and Ima-geNet (Deng et al., 2009) are used to evaluate the performance of our method. Our method requires one hyper-parameter p to control the FLOPs budget. The detailed choices of p are listed in Appendix F. For CIFAR-10, we compare with other methods on ResNet-56 and MobileNetV2. For ImageNet, we select ResNet-34, ResNet-50, ResNet-101 and MobileNetV2 as our target models. The reason we choose these models is because that ResNet (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) are much harder to prune than earlier models like AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan & Zisserman, 2014b) . λ decides the regularization strength in our method. We choose λ = 4 in all CIFAR-10 experiments and λ = 8 for all ImageNet experiments. For CIFAR-10 models, we train ResNet-56 from scratch following the pytorch examples. After pruning, we finetune the model for 160 epochs using SGD with a start learning rate 0.1, weight decay 0.0001 and momentum 0.8, the learning rate is multiplied by 0.1 at epoch 80 and 120. For ImageNet models, we directly use the pre-trained models released from pytorch (Paszke et al., 2017; 2019) . After pruning, we finetune the model for 100 epochs using SGD with a start learning rate 0.01, weight decay 0.0001 and momentum 0.9, and the learning rate is scaled by 0.1 at epoch 30, 60 and 90. For MobileNetV2 on ImageNet, we choose weight decay as 0.00004 which is the same with the original paper (Sandler et al., 2018) . For the training process of HSN, we use ADAM (Kingma & Ba, 2014) optimizer with a constant learning rate 0.001 and train HSN for 200 epochs. τ in Eq. 2 is set as 0.4. The β for α i is chosen as 0.01, and α i is updated as shown in Alg. 1. To build dataset D HSN , we random sample 2, 500 and 10, 000 samples for for CIFAR-10 and ImageNet separately. In the experiments, we found that a stand-alone validation set is not necessary, all samples in D HSN come from the original training set. All codes in this paper are implemented with pytorch (Paszke et al., 2017; 2019) . The experiments are conducted on a machine with 4 Nvidia Tesla P40 GPUs.

4.2. CIFAR-10 RESULTS

In Tab. 1, we present the comparison results on CIFAR-10 dataset. Our method is abbreviated as MCH (Model Compression via Hyper Structure Network) in the experiment section. For ResNet-56, our method can prune 50% of FLOPs while obtain 0.24% performance gain in accuracy. On MobileNetV2, our method can obtain 0.38% gain in accuracy. Compared to all other methods, our method can achieve the best results. Our method can outperform the second best method (DCP-Adapt) by 0.23% on ResNet-56. On MobileNetV2, our method can outperform the the second best method by 0.16% while pruning 14% more FLOPs. For both models, our method performs much better than early methods (He et al., 2017; 2018b; Li et al., 2017; He et al., 2018a) . Our method can outperform Method Architecture Pruned Top-1 Pruned Top-5 ∆ Top-1 ∆ Top-5 ↓ FLOPs Pruning Filters (Li et al., 2017) ResNet-34 72.17% --1.06% -24.8% Soft Prunings (He et al., 2018a) 71.84% 89.70% -2.09% -1.92% 41.1% IE (Molchanov et al., 2019) 72.83% --0.48% -24.2% FPGM (He et al., 2019) 72.63% 91.08% -1.29% -0.54% 41.1% MCH(ours) 72.85% 91.15% -0.45% -0.27% 44.0% IE (Molchanov et al., 2019) ResNet-50 74.50% --1.68% -45.0% FPGM (He et al., 2019) 74.83% 92.32% -1.32% -0.55% 53.5% GAL (Lin et al., 2019) 71.80% 90.82% -4.35% -2.05% 55.0% DCP (Zhuang et al., 2018) 74.95% 92.32% -1.06% -0.61% 55.6% CCP (Peng et al., 2019) 75.21% 92.42% -0.94% -0.45% 54.1% MetaPruning (Liu et al., 2019a) 75.40% --1.20% -51.2% GBN (You et al., 2019) 75.18% 92.41% -0.67% -0.26% 55.1% HRank (Lin et al., 2020) 74.98% 92.33% -1.17% -0.54% 43.8% Hinge (Li et al., 2020) 74.70% --1.40% -54.4% LeGR (Chin et al., 2020) 75 CCP by 0.32% in terms of ∆-Acc, which demonstrate that learning both inter-layer and inter-channel relationships are better than only considering inter-channel relationships.

4.3. IMAGENET RESULTS

In Tab. 2, the results on ImageNet are presented, Top-1/Top-5 accuracy after pruning are presented. Most of our comparison methods comes from recently published papers including IE (Molchanov et al., 2019) , FPGM (He et al., 2019) , GAL (Lin et al., 2019) , CCP (Peng et al., 2019) , MetaPruning (Liu et al., 2019a) , GBN (You et al., 2019) , Hinge (Li et al., 2020) , abd HRank (Lin et al., 2020) . ResNet-34. Our proposed MCH can prune 44.0% of FLOPs with 0.45% and 0.27% performance loss on Top-1 and Top-5 accuracy. Such a result is better than any other method. Proposed MCH performs similarly compared to IE (Molchanov et al., 2019) in Top-1 and ∆ Top-1 accuracy (72.85%/ -0.45% vs. 72.83%/ -0.43%), while our method can prune almost 20% more FLOPs. Given similar FLOPs pruning rate, our method achieves better results compared to FPGM (He et al., 2019) (-0.45%/-0.27% vs. -1.29%/-0.54% for ∆ Top-1/∆ Top-5). Besides IE and FPGM, The margin between our method and rest methods are even larger. ResNet-50. ResNet-50 is a very popular model for evaluating model compression methods. With such intense competition, our method can still achieve the best Top-1/Top-5 and ∆ Top-1/∆ Top-5 results. The second best method in terms of Top-1 accuracy is MetaPruning (Liu et al., 2019a) , which can achieve 75.40% Top-1 result after pruning. Our method outperforms MetaPruning by 0.20% in Top-1 accuracy while our method can prune 5% more FLOPs. MetaPruning utilizes hypernet to generate weights when evaluating sub-networks, however, such a design paradigm prohibits MetaPruning to be directly used on pre-trained models. The weights inherited from the pre-trained model might be one of the reasons why our method can outperform MetaPruning. GBN (You et al., 2019) obtains the second-best ∆ Top-1 accuracy, however, the accuracy after pruning is quite low compared to other methods. Our method can outperform GBN by 0.42% in Top-1 accuracy. Besides GBN and MetaPruning, our method can outperform two recent methods HRank (Lin et al., 2020) and Hinge (Li et al., 2020) by 0.62% to 0.90% on Top-1 accuracy. ResNet-101. For ResNet-101, our method can increase the performance of the baseline model by 0.21% and 0.25% on Top-1 and Top-5 accuracy, while removes 56% of FLOPs. The second best method FPGM (He et al., 2019) can maintain the performance and reducing 41% of FLOPs. In short, compared to FPGM, our method can obtain performance gain while pruning 15% more FLOPs. MobileNetV2. On MobileNetV2, we mainly compare with AMC (He et al., 2018b) and MetaPruning (Liu et al., 2019a) . Both of them can be regarded as representative works for AutoML related model compression methods (AMC uses reinforcement learning; MetaPruning uses evolutionary algorithm and hypernet). Our method can achieve 71.54% Top-1 accuracy while pruning around 30% of FLOPs, which is 0.34% and 0.74% higher than MetaPruning and AMC. These results show that our method can outperform AutoML based methods. In summary, our method can outperform these comparison methods and achieve the state-of-the-art performance. These experimental results also indicate that inter-channel and inter-layer relationships should be considered when designing model compression methods.

4.4. EFFECTS OF LAYER-WISE SCALING

We further study the impact of λ and layer-wise scaling (LWS) when training HSN on CIFAR-10. In Fig. 3 (a, b ), we can see that changing λ does not have a large impact on the final performance of a sub-network, and our method is not sensitive to it. One possible reason is that α i adapts to λ when using ADAM optimizer. In general, we do not spend too much time on tuning λ. In Fig. 3  (c, d ), it shows that using LWS can improve the final performance of a sub-network and obtain lower loss. Moreover, early layers usually have a larger preserved rate with LWS as shown in Fig 4 , indicating that alternative sub-network architectures can be discovered from LWS. Without LWS, the final performance of ResNet-56 will decrease 0.19%, achieves 93.04% final accuracy on CIFAR-10. Similar observations hold for MobileNetV2 (94.45% final accuracy and the relative gap is 0.16%). These observations show that LWS indeed helps the training of the HSN.

4.5. DETAILED ANALYSIS

In this section, we provide detailed analysis to answer the following questions: (1) why we use fixed inputs for a i ? (2) Can we replace HSN with dense layers? (3) Does LWS work for different learning rate settings? (4) Does LWS still work for other optimization methods? To answer the first question, we examine three different settings: learnable inputs, fixed inputs, and randomly generated inputs from the uniform distribution. From Fig. 5 (a,b), we observe that fixed inputs have similar performance to learned inputs, and both of them outperform random inputs. The idea of using fixed inputs is that we want to project the optimal sub-network to fixed vectors in the input space, which is generally simple (compared to learned inputs) and easy to train (compared to random inputs). The above results justify why we use fixed inputs. To verify the effectiveness of different components of HSN, we use three different settings: vanilla HSN, HSN only with dense layers and gates (definition is given in Appendix). From Fig. 5 (c, d ), it can be shown that HSN has the best performance, which again shows that we should not separately treat each channel or each layer. In Fig. 5 (e,f), we plot training curves for different learning rates with or without LWS. It can be seen that LWS can lead to better performance, given different learning rates. Finally, in Fig 5 (g,h) , we examine whether LWS is still useful given two additional optimizer: SGD and LARS (You et al., 2017) . LARS applies layer-wise learning rates on overall gradients, which can be complementary to LWS. When applying LWS on these two methods, it still improves performance. SGD is not a good choice when the optimization involves discrete values, as suggested by the previous study (Alizadeh et al., 2019) .

A VISUALIZATION OF PRUNED ARCHITECTURES

In Fig 6 , we visualize the pruned architecture for ResNet-50 and MobileNetV2.

B BIAS OF FLOPS REGULARIZATION

We briefly discuss two types of FLOPs regularization used in our paper and trainable gate (TG) (Kim et al., 2020) . First, we provide the specific definition of T (v i ) (FLOPs of ith layer): T (v i ) = K 2 i 1 T v i-1 G l 1 T v i W i H i , where G i is the number of groups in a convolution layer, K i is the kernel size, 1 is a all one vector, and 1 T v i is the number of perversed channels in ith layer. With T (v i ), T (v) = L i=1 T (v i ). In TG, they simply use mean square error (MSE) as the regularization term, and in their paper R MSE (T (v), pT total ) = (T (v) -pT total ) 2 . The gradients w.r.t v i is: ∂R MSE ∂v i = 2(T (v) -pT total ) ∂T (v i ) ∂v i , For the regularization used in our method: R(T (v), pT total ) = log(|T (v) -pT total | + 1), the gradients w.r.t v i is: ∂R ∂v i = 1 |T (v) -pT total | + 1 T (v) -pT total |T (v) -pT total | ∂T (v i ) ∂v i . For both regularization functions, the ratio between the gradients w.r.t v i of two layers k, j is ∂T (v k ) ∂v k / ∂T (vj ) ∂vj = K 2 k 1 T v k-1 G k W k H k K 2 j 1 T v j-1 G l Wj Hj . Take ResNet-50 as an example, let j, k be the middle layers of a bottleneck block, and we random initialize HSN. If j is in the first block, and k is in the last block, then K k = K j = 3, W j = H j = 56, W k = H k = 7, 1 T v j-1 ≈ 0.5 × 64 (due to random initialization), 1 T v k-1 ≈ 0.5 × 512, finally, ∂T (v k ) ∂v k / ∂T (vj ) ∂vj ≈ 3×3×256×7×7 3×3×32×56×56 ≈ 1 8 , which is not trivial. When calculating the gradients w.r.t θ i , we have ∂R ∂θi = c R ∂T (vi) ∂vi ∂vi ∂θi , all θ i share the same c R decided by the regularization function. Without loss of generality, we assume the magnitude of ∂vi ∂θi is similar given different layers. The assumption is based on the following derivation (to simplify derivation, we omit weight norm in dense layers): ∂v i ∂θ i = ∂z i ∂θ i , = ∂z i ∂o i ∂o i ∂θ i , = 1 τ sigmoid((o i + g)/τ )(1 -sigmoid((o i + g)/τ )) ∂o i ∂θ i ≤ 1 4τ b T i . where sigmoid(x)(1sigmoid(x)) ≤ 1 4 , and b i is the input to ith dense layer, which is also the outputs of GRU. Since all b i have the same shape, and weights in GRU are normalized, we can assume all b i have similar magnitude. Since 1 4τ b T i is a upper bound of ∂vi ∂θi , similar assumptions can be made. Following this assumption, the relative magnitude of gradients w.r.t θ j and θ k for layers j, k can be roughly represented by ∂T (v k ) ∂v k / ∂T (vj ) ∂vj . After training for a while, the ratio might be smaller, however, it only indicates that early layers are more aggressively pruned. Thus, when applying FLOPs regularization, it penalizes early layers much heavier compared to latter layers. One should also note that this is a general problem when using gradient based model compression methods with the FLOPs regularization. It's quite hard to circumvent calculating ∂T (vi) ∂vi as in TG (Kim et al., 2020) and our paper.  i , h i = GRU(a i , h i-1 ) o i = dense i (b i ) where h i and b i are hidden states and outputs of GRU at step i, o i is the final output of HSN. GRU also requires hidden layer input at time-step 0 h 0 . In the experiment, the h 0 is a all zero tensor. As mentioned in Tab. 3, the dimension of a i is 64. Since a i is a single input instead of a mini-batch, we cannot apply batchnorm. To make the training more stable, we use weight norm (Salimans & Kingma, 2016) on both GRU and dense layers. Initially, we tried to use a huge dense layer (input size 64, output size C 1 + C 2 + • • • + C L ) as HSN. However, we find that the huge dense layer is hard to optimize and also parameter heavy. To verify the strength of the proposed HSN, we can instead use a simplified setting to prune neural networks, which is shown as follows: ẑi = sigmoid(( θi + g)/τ ), vi = round(ẑ i ), and vi ∈ {0, 1} Ci , (11) where the architecture vector is parameterized by θi . Under this setting, the parameter for each channel does not have relationships. We use this setting to prune ResNet-56 and MobileNetV2 on CIFAR-10, the results are shown in Fig. 7 . From the figure, we can see that the performance and convergence speed of using HSN is much better. Under high dimensional setting, like MobileNetV2, the simplified setting shown in Eq. 11 can not learn efficiently, which demonstrate that capturing inter-channel and inter-layer relationships are crucial for pruning deep neural networks. Inputs a i , i=1,• • • , L GRU(64,128), WeightNorm, Relu dense i (128,C i ), WeightNorm, i=1, • • • , L Outputs o i , i=1, • • • , L

D FORWARD AND BACKWARD PRUNING

Here, we refer forward pruning as start pruning from a random sub-network, and refer backward pruning as start pruning from the original large model. Many model compression methods use backward pruning. We also provide a simple way to extend our method to backward pruning. When we binarize the output of HSN, we can add a constant c: z i = sigmoid((o i + (g + c))/τ ), v i = round(z i ), and v i ∈ {0, 1} Ci , ( ) where g ∼ Gumbel(0, 1), and the Gumbel(0, 1) distribution can be sampled using inverse transform sampling by drawing u ∼ U(0, 1) and computing g = -log(-log(u)). When the constant c is big enough, it will make v i become an all one vector, thus the sub-network produced by HSN will start from the original large CNN. If we set c to 0, then it will start from a random sub-network. In Fig. 8 , we show the results of forward and backward pruning. It can be seen that they can achieve similar sub-network performance, but the changes in regularization loss various dramatically.

E DERIVATIVE OF HYPER-GRADIENT WITH ADAM OPTIMIZER

The update rule of ADAM for θ i is shown in Alg. 2, and it is: u(θ t-1 i , α t i ) = θ t-1 i -η mt /( nt + ), Algorithm 2: ADAM optimizer for θ i Input: η, β 1 , β 2 ∈ [0, 1): learning rate and decay rate for ADAM. Initialize m 0 , n 0 , t = 0 Update rule at step t: m t = β 1 m t-1 + (1 -β 1 )(α t i ∂L ∂θ t-1 i + λ ∂R ∂θ t-1 i ) n t = β 2 n t-1 + (1 -β 2 )(α t i ∂L ∂θ t-1 i + λ ∂R ∂θ t-1 i ) 2 mt = m t /(1 -β t 1 ) nt = n t /(1 -β t 2 ) θ t i = u(θ t-1 i , α t i ) = θ t-1 i -η mt /( √ nt + ) Then the derivation of ∂u(θ t-1 i ,α t i ) ∂αi is: ∂u(θ t-1 i , α t i ) ∂α i = -η ∂ mt /( √ nt + ) ∂α i = -η -∂ √ nt+ ∂αi mt + ∂ mt ∂αi ( √ nt + ) ( √ nt + ) 2 = -η( ∂ mt ∂αi √ nt + - ∂ nt ∂αi mt 2 √ nt ( √ nt + ) 2 ) = -η (1 -β 1 ) ∂L ∂θ t-1 i (1 -β t 1 )( √ nt + ) - (1 -β 2 )(α t i ( ∂L ∂θ t-1 i ) 2 + λ ∂L ∂θ t-1 i ∂R ∂θ t-1 i ) mt √ nt ( √ nt + ) 2 (1 -β t 2 ) , where ∂ mt ∂αi = (1-β1) 



Figure1: Overview of our proposed method. The width and height dimension of weight tensors are omitted. The architecture vector v is firstly generated from fixed input a i , i = 1, . . . , L. Then, a sub-network is sampled according to the architecture vector v. The parameters of HSN are updated by using gradients from the loss function when evaluating the sub-network.

Figure 2: (a) For the original FLOPs regularization, some architectures may become unreachable. (b)After layer-wise scaling, the potential search space of architectures for a sub-network is increased.

5. update Θ by ADAM optimizer. end end return HSN with the final Θ.

Figure 3: (a,b): Effect of λ on the performance of sub-networks. (c,d): Effect of layer-wise scaling on the performance of sub-network. All experiments are done on CIFAR-10.

Figure 5: (a,b): Effect of different scheme for the inputs of HSN a i . (c,d): Effect of different settings of HSN. (e,f): Effect of different learning rates with LWS. (g,h): Effect of different optimizer on with LWS. For plots in (c,d,g,h), shaded areas represents variance from 5 trials.

Figure 6: (a,b): visualization of pruned architectures for ResNet-50 and MobileNetV2. (c,d): mean and variance of 20 generated sub-networks for pruning.

Figure 7: (a,b): Performance of sub-networks when using HSN or not using HSN (the setting in Eq. 11). (c,d): Regularization loss for the same settings.

Figure 8: (a,b): Performance of sub-networks when training HSN given forward (c=0) and backward pruning (c=3). (c,d): Regularization loss of sub-networks when training HSN given forward (c=0) and backward pruning (c=3). All experiments are done on CIFAR-10.

Only Algorithm 1: Model Compression via Hyper-Structure Network Input: dataset for training HSN: D HSN ; perversed rate of FLOPs: p; hyper-parameter: λ; training epochs: n E ; pre-trained CNN: f ; learning rate β when updating {α i } L i=1 . Initialization: initialize Θ randomly; initialize α i = 1, i = 1, . . . , L; freeze W in f .

Comparison results on CIFAR-10 dataset with ResNet-56 and MobileNetV2. ∆-Acc represents the performance changes before and after model pruning. +/-indicates increase or decrease compared to baseline results.

Comparison results on ImageNet dataset with ResNet-34, ResNet-50, ResNet-101 and MobileNetV2. ∆-Acc represents the performance changes before and after model pruning. +/indicates increase or decrease compared to baseline results.

The structure of HSN used in our method.

The cost of the extra storage is trivial.F CHOICE OF p GIVEN DIFFERENT DATASETS AND ARCHITECTURES.

Choice of p for different models. p is the remained FLOPs divided by the total FLOPs In Tab. 4, we list the choices of p for different models and datasets used in our experiments.

5. CONCLUSION

In this paper, we proposed a hyper-structure network for model compression to capture inter-channel and inter-layer relationships. An architecture vector can be generated from HSN to select a subnetwork from the original model. At the same time, we evaluated this sub-network by using classification and resource losses. The HSN can be updated by the gradients from them. Moreover, we also identified the problem of FLOPs constraint (bias towards latter layers), which limits the final search space of HSN. To solve it, we further proposed layer-wise scaling to balance the gradients. With the aforementioned novel techniques, our method can achieve state-of-the-arts performance on ImageNet with four different architectures.

