CNN COMPRESSION AND SEARCH USING SET TRANS-FORMATIONS WITH WIDTH MODIFIERS ON NETWORK ARCHITECTURES

Abstract

We propose a new approach, based on discrete filter pruning, to adapt off-the-shelf models into an embedded environment. Importantly, we circumvent the usually prohibitive costs of model compression. Our method, Structured Coarse Block Pruning (SCBP), prunes whole CNN kernels using width modifiers applied to a novel transformation of convlayers into superblocks. SCBP uses set representations to construct a rudimentary search to provide candidate networks. To test our approach, the original ResNet architectures serve as the baseline and also provide the 'seeds' for our candidate search. The search produces a configurable number of compressed (derived) models. These derived models are often 20% faster and 50% smaller than their unmodified counterparts. At the expense of accuracy, the size can become even smaller and the inference latency lowered even further. The unique SCBP transformations yield many new model variants, each with their own trade-offs, and does not require GPU clusters or expert humans for training or design.

1. INTRODUCTION

Modern Computer Vision (CV) is dominated by the convolution operation introduced by Fukushima & Miyake (1982) and later advanced into a Convolutional Neural Network (CNN or convnet) by LeCun et al. (1989) . Until recently, these convnets were limited to rudimentary CV tasks such as classifying handwritten digits LeCun et al. (1998) . Present-day convnets have far surpassed other CV approaches by improving their framework to include faster activations Nair & Hinton (2010) , stacked convolutional layers (convlayers) Krizhevsky et al. (2012) , and better optimizers Kingma & Ba (2014) . These multi-layer deep convnets require big data in the form of datasets such as ImageNet Deng et al. (2009) to enable deep learning LeCun et al. (2015) of the feature space. However effective, convnets are held back by their high resource consumption. Utilizing an effective convnet on the edge presents new challenges in latency, energy, and memory costs Chen & Ran (2019). Additionally, many tasks, such as autonomous robotics, require realtime processing and cannot be offloaded to the cloud. As such. resource constrained platforms, such as embedded systems, lack the compute and memory to use convnets in their default constructions. Analysis into convnets reveals that they are overparameterized Denil et al. (2013) and that reducing this overparameterization can be a key mechanism in compressing convnets Hanson & Pratt (1988) ; LeCun et al. (1990); Han et al. (2015a) . The many weights that form a network are not necessarily of the same entropy and can therefore be seen as scaffolding to be removed during a compression step Hassibi & Stork (1993); Han et al. (2015b); Tessier et al. (2021) . In this work, our objective is to reduce the size of any given convnet using an automated approach requiring little human engineering and compute resources. To that end, we design Structured Coarse Block Pruning (SCBP), a compressing mechanism that requires no iterative retraining or fine-tuning. SCBP uses a low-cost search method, seeded with an off-the-shelf network, to generate compressed models derivatives with unique accuracy, size, and latency trade-offs. The reminder of this paper is organized as follows. Section 2 focuses on closely related works. Section 3 details the methodology and implementation of SCBP. Section 4 discusses experimental findings, and finally we conclude with key takeaways and future directions in Section 5.

2. RELATED WORKS

Early work on removing parameters from Artificial Neural Networks (ANNs) was focused in gaining insights on the purpose of those parameters Hanson & Pratt (1988) In our proposed mechanism, we bridge a gap between manual and NAS approaches by using a low-cost search to order network width attributes of any given CV model, which is partitioned by a novel algorithm into multiple segments with each being assigned its own width modifier. A close work is from Howard et al. (2017) in the form of MobileNets which are a family of networks using different uniform width modifiers on a manually engineered baseline model. Similarly, EfficientNets Tan & Le (2019) expands the idea of modifiers to include depth and input resolution modifiers. Our approach benefits from generalized compression that can be applied to any model because we do not require a new baseline that needs to be engineered and thus can keep cost within 10 1 GPU hours.

3. METHOD

The compression approach detailed below is realized by novel combination of convlayer binning, width modification, and a unique search-train step based on set transforms of the aforementioned combination. Unlike most network architecture search methods that impose prohibitively long search and train times, our work circumvents the cost problem by providing a halfway point between NAS and human engineered architectures. In doing so, we present a rudimentary proof of concept which, in our evaluations, can produce an efficient search and thus generate derivative models when configured by simple human defined search domain set constraints. The SCBP version we use stands on four foundations: (1) a seed architecture from which derivative architectures will be produced; (2) a network segmentation mechanism for the seed architecture for binning and assist in derivation; (3) a set of compression ratios (c-ratios) for each segment of the seed network; and (4) a one shot search for network instantiation based on (1)-(3).



. Prior to AlexNet Krizhevsky et al. (2012), exploiting ANN overparameterization was used as a regularization mechanism LeCun et al. (1990); Hassibi & Stork (1993). Recently, ANN overparameterization is exploited to reduce the size of models Han et al. (2015b); Zhou et al. (2017); Tessier et al. (2021). Removing parameters compresses the memory footprint of CV models, which can then allow their deployment on embedded systems. Compression additionally facilitates reduced energy costs while also reducing latency by greatly reducing memory traffic Han et al. (2015a); Zhou et al. (2017). Model accuracy is sustained or reduced, depending on the method of compression. Preserving a compressed CV model's baseline accuracy is challenging and requires large compute Han et al. (2015b); Zhou et al. (2017). A common mechanism for maintaining a trained model's accuracy is to iteratively reduce its size in prune-retrain cycles. Another mechanism is leveraging a Network Automated Search (NAS), often using reinforcement learning, to build networks from scratch that are both small and accurate Zoph et al. (2018); Cai et al. (2020). However, both prune-retrain and NAS are exorbitant in compute usage, typically on the order of 10 3 and 10 5 GPU hours respectively. When computing resources are limited, faster mechanisms for compression are needed. A range of techniques are available, such as tensor factorization Kim et al. (2015); Phan et al. (2020); Swaminathan et al. (2020) and Fast Fourier Transforms (FFT) on CV models' weight tensors. Hashing can also be used to group similar weights into buckets Chen et al. (2015); Hu et al. (2018). These techniques, while faster to train, do not maintain the original network's accuracy and often produce larger models relative to prune-retrain and NAS approaches. Quantization is also frequently used Gong et al. (2014); Wu et al. (2016) to reduce the bit-width of parameters from 64-bit floats down-to 5-bit ints or less Wu et al. (2016); Zhou et al. (2017). In special cases, only 2-bit parameters are sufficient Rastegari et al. (2016); Courbariaux et al. (2016; 2015). Other techniques include those based on weight-decay and entropy Luo & Wu (2017); Min et al. (2018); Tessier et al. (2021).

