PRUNING WITH OUTPUT ERROR MINIMIZATION FOR PRODUCING EFFICIENT NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Deep Neural Networks (DNNs) are dominant in the field of machine learning. However, because DNN models have large computational complexity, implementation with resource-limited equipment is challenging. Therefore, techniques of compressing DNN models without degrading their accuracy is desired. Pruning is one such technique to remove redundant neurons (or channels). In this paper, we present Pruning with Output Error Minimization (POEM), a method that performs not only pruning but also reconstruction to compensate the error caused by pruning. The strength of POEM lies in its reconstruction to minimize the output error of the activation function, while the previous methods minimize the error of the value before applying the activation function. The experiments with well-known DNN models MobileNet) and image recognition datasets (ImageNet, CUB-200-2011) were conducted. The results show that POEM significantly outperformed the previous methods in maintaining the accuracy of the compressed models.

1. INTRODUCTION

Nowadays, Deep Neural Networks (DNNs) are dominant in the field of machine learning. The demand for DNNs is increasing in various applications. However, DNNs are known to be overparameterized and require large computational cost. This makes them computationally slow, powerconsuming, and difficult to be implemented in resource-limited equipment. Therefore, there is a need for the techniques to create efficient DNN models by compressing large models while maintaining the accuracy. Pruning is one such technique to remove redundant weights from trained DNN models. Pruning methods can be divided into two groups: unstructured pruning and structured pruning. The former removes weight parameters in order to make the weight tensor sparse. Since the shape of the weight tensor remains the same, the compressed model should be implemented using hardware and libraries that can perform calculations only on non-zero weights. The latter removes neurons (or channels) in order to make the shape of the weight tensor smaller. Therefore, the effect of compression can be achieved by implementing the compressed model using general hardware and libraries. In this paper, we focus on structured pruning. How well a pruned model maintains its accuracy depends on two factors. The first is compression ratio optimization, in other words, how many neurons are reduced in each layer. The other is layerwise optimization, in other words, which neurons to be preserved in each layer. In recent years, there is a growing awareness that the value of pruning lies in the search for an efficient sub-architecture out of a large redundant architecture. This is due to the research results showing that a DNN model with the pruned architecture trained from scratch can achieve at least as good accuracy as the pruned and fine-tuned model (Liu et al., 2019) . For this reason, the recent trend is to focus on compression ratio optimization problem. However, does it mean that the layer-wise optimization is no more important? It is reasonable to claim that combining a compression ratio optimization method with a better layer-wise optimization method should result in more effective pruning. Therefore, layer-wise optimization is still important and worth investigating. In this paper, we propose a pruning method named Pruning with Output Error Minimization (POEM) that performs layer-wise optimization. The strength of POEM lies in its reconstruction using the Weighted Least Squares (WLS) method so as to minimize the output error of the activation function, while the previous methods (Luo et al., 2017; He et al., 2017; Dong et al., 2017; Kamma & Wada, 2021) minimize the error of the value before applying the activation function. For example, since the ReLU function rounds a negative value to zero, the error on a negative element need not be compensated (unless it turns positive due to the error). POEM can perform reconstruction only for the positive elements, while the previous methods perform reconstruction for all elements including negative ones. For this reason, POEM is superior to the previous methods in maintaining the accuracy of the pruned model. To the best of our knowledge, POEM is the first method to perform reconstruction based on the output error of the activation function. For verifying POEM, we conducted experiments on ImageNet (Deng et al., 2009) , a large-scale image classification dataset, and well-known DNN models for image classification, such as VGG-16 (Simonyan & Zisserman, 2015) , ResNet-18 (He et al., 2016), and MobileNet (Howard et al., 2017) . The results show that POEM can prevent the output error of the activation function better than the previous methods (He et al., 2017; Kamma & Wada, 2021) , and improve the accuracy both before and after fine-tuning. We also confirmed that the accuracy of the compressed model can be further improved by combining POEM and the compression ratio optimization methods (He et al., 2018b; Kamma et al., 2022; Li et al., 2022) . The rest of this paper is structured as follows. In Sec. 2, we introduce related works. In Sec. 3, we explain our proposed method. In Sec. 4, we show experimental results to verify the effectiveness of POEM. In Sec. 5, we conclude discussions in this paper.

2. RELATED WORKS

DNN compression methods can be divided into four groups: structured pruning, unstructured pruning (or sparsification), tensor decomposition, and quantization. Structured pruning is to make the shape of the weight tensor smaller by removing redundant neurons (or channels) (Molchanov et al., 2017; He et al., 2018a; 2017; Luo et al., 2017; Kamma & Wada, 2021; Jiang et al., 2018) . The advantage of these methods is that the effect of compression can be obtained without special hardware or libraries. Unstructured pruning is to make the weight tensor sparse without changing its shape (LeCun et al., 1990; Liu et al., 2015; Han et al., 2016; Lee et al., 2019) . The effect of unstructured pruning can be obtained by implementing the compressed model using hardware or libraries that can perform computation only for the non-zero elements of the weight tensor. The methods based on tensor decomposition replace a large weight tensor by the product of multiple smaller weight tensors (Xue et al., 2013; Kim et al., 2019; Denton et al., 2014) . These methods can effectively reduce the number of parameters and the computational complexity, although the compressed model gets extra layers incurring an additional computational overhead. Quantization is to reduce the memory and complexity requirements of a model by discretizing the weights (Courbariaux et al., 2015; Liu et al., 2022; Li et al., 2021; Wei et al., 2022) . A quantized model needs to be implemented on low-bit computation equipment. In this paper, we focus on structured pruning because of the following benefits: the structured pruning methods can compress the model without incurring computational overhead; the compressed model can be implemented without special hardware or libraries. For developing an effective pruning method, 2 problems should be addressed. One is the problem of compression ratio optimization, and the other is layer-wise optimization. Some pruning methods address both of these problems in a single framework, while others handle each problem separately. The compression ratio optimization is to configure the number of pruned neurons in each layer. AutoML Model Compression (AMC) uses reinforcement learning to tune compression ratios so that the accuracy of the model is maximized posing a constraint on FLOPs (the number of floating point multiplications), or the FLOPs are minimized posing a constraint on accuracy (He et al., 2018b) . Pruning Ratio Optimizer (PRO) tunes compression ratios by alternately performing layer selection and pruning based on the output error of the final layer (Kamma et al., 2022) . RandomPruning performs random search in the search space of compression ratios (Li et al., 2022) . These methods can be combined with any layer-wise optimization methods. The layer-wise optimization is to select which neurons to be preserved in each layer. A lot of neuron selection criteria have been investigated, such as the ones based on the norm of outgoing

