NETWORK PRUNING THAT MATTERS: A CASE STUDY ON RETRAINING VARIANTS

Abstract

Network pruning is an effective method to reduce the computational expense of over-parameterized neural networks for deployment on low-resource systems. Recent state-of-the-art techniques for retraining pruned networks such as weight rewinding and learning rate rewinding have been shown to outperform the traditional fine-tuning technique in recovering the lost accuracy (Renda et al., 2020), but so far it is unclear what accounts for such performance. In this work, we conduct extensive experiments to verify and analyze the uncanny effectiveness of learning rate rewinding. We find that the reason behind the success of learning rate rewinding is the usage of a large learning rate. Similar phenomenon can be observed in other learning rate schedules that involve large learning rates, e.g., the 1-cycle learning rate schedule (Smith & Topin, 2019). By leveraging the right learning rate schedule in retraining, we demonstrate a counter-intuitive phenomenon in that randomly pruned networks could even achieve better performance than methodically pruned networks (fine-tuned with the conventional approach). Our results emphasize the cruciality of the learning rate schedule in pruned network retraining -a detail often overlooked by practioners during the implementation of network pruning.

1. INTRODUCTION

Training neural networks is an everyday task in the era of deep learning and artificial intelligence. Generally speaking, given data availability, large and cumbersome networks are often preferred as they have more capacity to exhibit good data generalization. In the literature, large networks are considered easier to train than small ones (Neyshabur et al., 2018; Arora et al., 2018; Novak et al., 2018; Brutzkus & Globerson, 2019) . Thus, many breakthroughs in deep learning are strongly correlated to increasingly complex and over-parameterized networks. However, the use of large networks exacerbate the gap between research and practice since real-world applications usually require running neural networks in low-resource environments for numerous purposes: reducing memory, latency, energy consumption, etc. To adopt those networks to resourceconstrained devices, network pruning (LeCun et al., 1990; Han et al., 2015; Li et al., 2016) is often exploited to remove dispensable weights, filters and other structures from neural networks. The goal of pruning is to reduce overall computational cost and memory footprint without inducing significant drop in performance of the network. A common approach to mitigating performance drop after pruning is retraining: we continue to train the pruned models for some more epochs. In this paper, we are interested in approaches based on learning rate schedules to control the retraining. A well-known practice is fine-tuning, which aims to train the pruned model with a small fixed learning rate. More advanced learning rate schedules exist, which we generally refer to as retraining. The retraining step is a critical part in implementing network pruning, but it has been largely overlooked and tend to vary in each implementation including differences in learning rate schedules, retraining budget, hyperparameter choices, etc. Recently, Renda et al. ( 2020) proposed a state-of-the-art technique for retraining pruned networks namely learning rate rewinding (LRW). Specifically, instead of fine-tuning the pruned networks with a fixed learning rate, usually the last learning rate from the original training schedule (Han et al., 2015; Liu et al., 2019) , the authors suggested using the learning rate schedule from the previous t epochs (i.e. rewinding). This seemingly subtle change in learning rate schedule led to an important result: LRW was shown to achieve comparable performance to more complex and computationally expensive pruning algorithms while only utilizing simple norm-based pruning. Unfortunately, the authors did not provide the analysis to justify the improvement. In general, it is intriguing to understand the importance of a learning rate schedule and how it affects the final performance of a pruned model. In this work, we study the behavior of pruned networks under different retraining settings. We found that the efficacy from retraining with learning rate rewinding is rooted in the use of a large learning rate, which helps pruned networks to converge faster after pruning. We demonstrate that the success of learning rate rewinding over fine-tuning is not exclusive to the learning rate schedule coupling with the original training process. Retraining with a large learning rate could outperform fine-tuning even with some modest retraining, e.g., for a few epochs, and regardless of network compression ratio. We argue that retraining is of paramount importance to regain the performance in network pruning and should not be overlooked when comparing two pruning algorithms. This is evidenced by our extensive experiments: (1) randomly pruned network can outperform methodically pruned network with only (hyper-parameters free) modifications of the learning rate schedule in retraining, and (2) a simple baseline such as norm-based pruning can perform as well as as other complex pruning methods by using a large learning rate restarting retraining schedule. The contributions of our work are as follows. • We document a thorough experiment on learning rate schedule for the retraining step in network pruning with different pruning configurations; • We show that learning rate matters: pruned models retrained with a large learning rate consistently outperform those trained by conventional fine-tuning regardless of specific learning rate schedules; • We present a novel and counter-intuitive result achieved by solely applying large learning rate retraining: a randomly pruned network and a simple norm-based pruned network can perform as well as networks obtained from more sophisticated pruning algorithms. Given the significant impact of learning rate schedule in network pruning, we advocate the following practices: learning rate schedule should be considered as a critical part of retraining when designing pruning algorithms. Rigorous ablation studies with different retraining settings should be made for a fair comparison of pruning algorithms. To facilitate reproducibility, we would release our implementation upon publication.

2. PRELIMINARY AND METHODOLOGY

Pruning is a common method to produce compact and high performance neural networks from their original large and cumbersome counterparts.We can categorize pruning approaches into three classes: Pruning after training -which consists of three steps: training the original network to convergence, prune redundant weights based on some criteria, and retrain the pruned model to regain the performance loss due to pruning (Li et al., 2016; Han et al., 2015; Luo et al., 2017; Ye et al., 2018; Wen et al., 2016; He et al., 2017) ; Pruning during training -we update the "pruning mask" while training the network from scratch, thus, allowing pruned neurons to be recovered (Zhu & Gupta, 2017; Kusupati et al., 2020; Wortsman et al., 2019; Lin et al., 2020b; He et al., 2019; 2018) ; Pruning before training -Inspired by the Lottery Ticket Hypothesis (Frankle & Carbin, 2019) , some recent works try to find the sparsity mask at initialization and train the pruned network from scratch without changing the mask (Lee et al., 2019; Tanaka et al., 2020; Wang et al., 2020) . In this work, we are mainly concerned with the first category i.e. pruning after training which has the largest body of work to our knowledge. Traditionally, the last step is referred to as fine-tuning, i.e., continue to train the pruned model with a small learning rate obtained from the last epoch of the original model. This seemly subtle step is often overlooked when designing pruning algorithms. Particularly, we found that the implementation of previous pruning algorithms have many notable differences in their retraining step: some employed a small value of learning rate (e.g. 0.001 on ImageNet) to fine-tune the network (Molchanov et al., 2016; Li et al., 2016; Han et al., 2015) for a small number of epochs, e.g., 20 epochs in the work by Li et al. (2016) ; some used a larger value of learning rate (0.01) with much longer retraining budgets, e.g., 60, 100 and 120 epochs respectively on ImageNet (Zhuang et al., 2018; Gao et al., 2020; Li et al., 2020) 



; You et al. (2019); Li et al. (2020) respectively utilized 1-cycle (Smith & Topin, 2019) and cosine annealing learning rate schedule

