MODEL COMPRESSION VIA HYPER-STRUCTURE NET-WORK

Abstract

In this paper, we propose a novel channel pruning method to solve the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Previous channel pruning methods usually ignore the relationships between channels and layers. Many of them parameterize each channel independently by using gates or similar concepts. To fill this gap, a hyper-structure network is proposed to generate the architecture of the main network. Like the existing hypernet, our hyperstructure network can be optimized by regular backpropagation. Moreover, we use a regularization term to specify the computational resource of the compact network. Usually, FLOPs is used as the criterion of computational resource. However, if FLOPs is used in the regularization, it may over penalize early layers. To address this issue, we further introduce learnable layer-wise scaling factors to balance the gradients from different terms, and they can be optimized by hyper-gradient descent. Extensive experimental results on CIFAR-10 and ImageNet show that our method is competitive with state-of-the-art methods.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have accomplished great success in many machine learning and computer vision tasks (Krizhevsky et al., 2012; Redmon et al., 2016; Ren et al., 2015; Simonyan & Zisserman, 2014a; Bojarski et al., 2016) . To deal with real world applications, recently, the design of CNNs becomes more and more complicated in terms of width, depth, etc. (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014b; He et al., 2016; Huang et al., 2017) . Although these complex CNNs can attain better performance on benchmark tasks, their computational and storage costs increase dramatically. As a result, a typical application based on CNNs can easily exhaust an embedded or mobile device due to its enormous costs. Given such costs, the application can hardly be deployed on resource-limited platforms. To tackle these problems, many methods (Han et al., 2015b; a) have been devoted to compressing the original large CNNs into compact models. Among these methods, weight pruning and structural pruning are two popular directions. Unlike weight pruning or sparsification, structural pruning, especially channel pruning, is an effective way to truncate the computational cost of a model because it does not require any post-processing steps to achieve actual acceleration and compression. Many existing works (Liu et al., 2017; Ye et al., 2018; Huang & Wang, 2018; Kim et al., 2020; You et al., 2019) try to solve the problem of structure pruning by applying gates or similar concepts on channels of a layer. Although these ideas have achieved many successes in channel pruning, there are some potential problems. Usually, each gate has its own parameter, but parameters from different gates do not have dependence. As a result, they can hardly learn inter-channel or inter-layer relationships. Due to the same reason, the slimmed models from these methods could overlook the information between different channels and layers, potentially bringing sub-optimal model compression results. To address these challenges, we propose a novel channel pruning method inspired by hypernet (Ha et al., 2016) . In hypernet, they propose to use a hyper network to generate the weights for another network, while the hypernet can be optimized through backpropagation. We extend a hypernet to a hyper-structure network to generate an architecture vector for a CNN instead of weights. Each architecture vector corresponds to a sub-network from the main (original) network. By doing so, the inter-channel and inter-layer relationships can be captured by our hyper-structure network. Besides the hyper-structure network, we also introduce a regularization term to control the computational budget of a sub-network. Recent model compression methods focus on pruning computational FLOPs instead of parameters. The problem of applying FLOPs regularization is that the gradients of the regularization will heavily penalize early layers which can be regarded as a bias towards latter layers. Such a bias will restrict the potential search space of sub-networks. To make our hyper-structure network explore more possible structures, we further introduce layer-wise scaling factors to balance the gradients from different losses for each layer. These factors can be optimized by hyper-gradient descent. Our contributions are summarized as follows: 1) Inspired by hypernet, we propose to use a hyper-structure network for model compression to capture inter-channel and inter-layer relationships. Similar to hypernet, the proposed hyper-structure network can be optimized by regular backpropagation. 2) Gradients from FLOPs regularization are biased toward latter layers, which truncate the potential search space of a sub-network. To balance the gradients from different terms, layerwise scaling factors are introduced for each layer. These scaling factors can be optimized through hyper-gradient descent with trivial additional costs. 3) Extensive experiments on CIFAR-10 and ImageNet show that our method can outperform both conventional channel pruning methods and AutoML based pruning methods on ResNet and MobileNetV2.

2.1. MODEL COMPRESSION

Recently, model compression has drawn a lot of attention from the community. Among all model compression methods, weight pruning and structural pruning are two popular directions. Weight pruning eliminates redundant connections without assumptions on the structures of weights. Weight pruning methods can achieve a very high compression rate while they need specially designed sparse matrix libraries to achieve acceleration and compression. As one of the early works, Han et al. (2015b) proposes to use L 1 or L 2 magnitude as the criterion to prune weights and connections. SNIP (Lee et al., 2019) updates the importance of each weight by using gradients from loss function. Weights with lower importance will be pruned. Lottery ticket hypothesis (Frankle & Carbin, 2019) assumes there exist high-performance sub-networks within the large network at initialization time. They then retrain the sub-network with the same initialization. In rethinking network pruning (Liu et al., 2019b) , they challenging the typical model compression process (training, pruning, fine-tuning), and argue that fine-tuning is not necessary. Instead, they show that training the compressed model from scratch with random initialization can obtain better results. One of the previous works (Li et al., 2017) in structural pruning uses the sum of the absolute value of kernel weights as the criterion for filter pruning. Instead of directly pruning filters based on magnitude, structural sparsity learning (Wen et al., 2016) is proposed to prune redundant structures with Group Lasso regularization. On top of structural sparsity, GrOWL regularization is applied to make similar structures share the same weights (Zhang et al., 2018) . One of the problems when using Group Lasso is that weights with small values could still be important, and it's difficult for structures under Group Lasso regularization to achieve exact zero values. As a result, Louizos et al. (2018) propose to use explicit L 0 regularization to make weights within structures have exact zero values. Besides using the magnitude of structure weights as a criterion, other methods utilize the scaling factor of batchnorm to achieve structure pruning, since batchnorm (Ioffe & Szegedy, 2015) is widely used in recent neural network designs (He et al., 2016; Huang et al., 2017) . A straightforward way to achieve channel pruning is to make the scaling factor of batchnorm to be sparse (Liu et al., 2017) . If the scaling factor of a channel fell below a certain threshold, then the channel will be removed. The scaling factor can also be regarded as the gate parameter of a channel. Methods related to this concept include (Ye et al., 2018; Huang & Wang, 2018; Kim et al., 2020; You et al., 2019) . Though it has achieved many successes in channel pruning, using gates can not capture the relationships between channels and across layers. Besides using gates, Collaborative channel pruning (Peng et al., 2019) try to prune channels by using Taylor expansion. Our method is also related to Automatic Model Compression(AMC) (He et al., 2018b) . In AMC, they use policy gradient to update the policy

