IDP: ITERATIVE DIFFERENTIABLE PRUNING FOR NEU-RAL NETWORKS WITH PARAMETER-FREE ATTENTION

Abstract

Deep Neural Network (DNN) pruning is an effective method to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators, at the risk of decreasing model accuracy. In this paper, we propose a novel differentiable pruning scheme, Iterative Differentiable Pruning or IDP which offers state-of-the-art qualities in model size, accuracy, and training cost. IDP creates soft pruning masks based on fixed-point attention for a given sparsity target to achieve the state-of-the-art trade-offs between model accuracy and inference compute with negligible training overhead. We evaluated IDP on various computer vision and natural language processing tasks, and found that IDP delivers the state-of-the-art results. For MobileNet-v1, IDP can achieve 68.2% top-1 ImageNet1k accuracy with 86.6% sparsity which is 2.3% higher accuracy than the latest state-of-the-art pruning algorithms. For ResNet18, IDP offers 69.5% top-1 ImageNet1k accuracy with 85.5% sparsity at the same training cost which is 0.8% better than the state-of-the-art method. Also, IDP demonstrates over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy.

1. INTRODUCTION

While advanced deep neural networks (DNN) have exceeded human performance on many complex cognitive tasks (Silver et al., 2018) , their deployment onto mobile/edge devices, such as watches or glasses, for enhanced user experience (i.e., reduced latency and improved privacy) is still challenging. Most such on-device systems are battery-powered and are heavily resource-constrained, thus requiring DNNs to have very high power/compute/storage efficiency (Wang et al., 2019; Wu et al., 2018; Howard et al., 2017; Vasu et al., 2022; Wang et al., 2020b) . Such efficiency can be accomplished by mixing and matching various techniques, such as designing efficient DNN architectures like MobileNet/MobileViT/ MobileOne (Sandler et al., 2018; Mehta & Rastegari, 2022; Vasu et al., 2022) , distilling a complex DNN into a smaller architecture (Polino et al., 2018) , quantizing/compressing the weights of DNNs (Cho et al., 2022; Han et al., 2016; J. Lee, 2021; Park & Yoo, 2020; Li et al., 2019; Zhao et al., 2019) , and pruning near-zero weights (Peste et al., 2021; Kusupati et al., 2020; Liu et al., 2021; Zhang et al., 2022; Sanh et al., 2020; Zafrir et al., 2021; Zhu & Gupta, 2018; Wortsman et al., 2019) . Also, pruning is known to be highly complementary to quantization/compression (Wang et al., 2020b) when optimizing a DNN model. Training a larger model and then compressing it by pruning has been shown to be more effective in terms of model accuracy than training a smaller model (Li et al., 2020) from the beginning. However, pruning comes at the cost of degraded model accuracy, and the trade-off is not straightforward (Kusupati et al., 2020) . Hence, a desirable pruning algorithm should achieve high accuracy and accelerate inference for various types of networks without significant training overheads in costs and complexity. In this work, we propose a simple yet effective pruning technique, Iterative Differentiable Pruning or IDP based on an parameter-free attention mechanism (Bahdana et al., 2015; Xu et al., 2015) that satisfies all of the above criteria. Our attention approach allows a pruning mask to be differentiable, and lets trainingloss decide whether/how each weight will be pruned. Therefore, such a loss-driven differentiable pruning mask will help capture the interactions among weights automatically without expensive mechanisms (Liu et al., 2021) . Also, IDP requires neither additional learning parameters (Zhang et al., 2022) nor complicated training flows (Peste et al., 2021) , yet offers a precise control on the target

