A UNIFIED FRAMEWORK FOR SOFT THRESHOLD PRUNING

Abstract

Soft threshold pruning is among the cutting-edge pruning methods with state-ofthe-art performance 1 . However, previous methods either perform aimless searching on the threshold scheduler or simply set the threshold trainable, lacking theoretical explanation from a unified perspective. In this work, we reformulate soft threshold pruning as an implicit optimization problem solved using the Iterative Shrinkage-Thresholding Algorithm (ISTA), a classic method from the fields of sparse recovery and compressed sensing. Under this theoretical framework, all threshold tuning strategies proposed in previous studies of soft threshold pruning are concluded as different styles of tuning L 1 -regularization term. We further derive an optimal threshold scheduler through an in-depth study of threshold scheduling based on our framework. This scheduler keeps L 1 -regularization coefficient stable, implying a time-invariant objective function from the perspective of optimization. In principle, the derived pruning algorithm could sparsify any mathematical model trained via SGD. We conduct extensive experiments and verify its state-of-the-art performance on both Artificial Neural Networks (ResNet-50 and MobileNet-V1) and Spiking Neural Networks (SEW ResNet-18) on ImageNet datasets. On the basis of this framework, we derive a family of pruning methods, including sparsify-during-training, early pruning, and pruning at initialization. The code is available at https://github.com/Yanqi-Chen/LATS.

1. INTRODUCTION

Pruning has been a thriving area of network compression. Since the day deep neural networks stretch their tentacles to every corner of machine learning applications, the demand for shrinking the size of network parameters has never stopped growing. Fewer parameters usually imply less computing burden on resource-constrained hardware such as embedded devices or neuromorphic chips. Some pioneering studies have revealed considerable redundancies in both Artificial Neural Networks (ANNs) (Han et al., 2015; 2016; Wen et al., 2016; Liu et al., 2017) and Spiking Neural Networks (SNNs) (Qi et al., 2018; Chen et al., 2021; Yin et al., 2021; Deng et al., 2021; Kundu et al., 2021; Kim et al., 2022b) . In essence, pruning can be formulated as an optimization problem under constraint on L 0 norm, the number of nonzero components in network parameters. Assuming L is the loss function of vectorized network weight w, we expect lower L 0 norm ∥w∥ 0 along with lower loss L(w). Despite different formulations like hard constraints min L(w)≤c ∥w∥ 0 ; (1) min ∥w∥0≤K L(w); or soft constraints (penalized) min w {L(w) + µ∥w∥ 0 }, all these forms are without exception NP-Hard (Natarajan, 1995; Davis et al., 1997; Nguyen et al., 2019) . Relaxing L 0 norm to L p (0 < p < 1) norm will not make it more tractable for it is still strongly NP-Hard (Ge et al., 2011) . Nowadays, research on pruning and sparse optimization is mainly focused on the L 1 -regularized problem, the tightest convex relaxation of L 0 norm, which dates back to a series of groundbreaking studies on compressed sensing (Donoho, 2006; Candès et al., 2006) . These researches technically allows us to solve L 1 -regularized problem as an alternative or, sometimes even an equivalent option (Candès, 2008) to confront L 0 norm constraint. A variety of modern methods such as magnitude-based pruning are still firmly rooted in solving the L 1 regularized optimization problem. Be that as it may, L 1 regularization is mostly employed for shrinking the magnitude of weight before the hard thresholding step, which has started to be replaced by other sorts of novel regularization (Zhuang et al., 2020) . In the past few years, a new range of pruning methods based on soft threshold reparameterization of weights has been developing gradually. The term "reparameterization" here refers to a specific mapping to network weights w from a latent space of hidden parameters θ. The "geometry" of latent space could be designed for guiding actual weights w towards sparsity. In soft threshold pruning, the mapping is an element-wise soft threshold function with time-variant threshold. Among these studies, two representative ones are Soft Threshold weight Reparameterization (STR) (Kusupati et al., 2020) and State Transition of Dendritic Spines (STDS) (Chen et al., 2022) . They both achieve the best performance of that time. STDS further demonstrates the analogy between soft threshold mapping and a structure in biological neural systems, i.e., dendritic filopodia and mature dendritic spines. However, few researchers notice that soft threshold mapping also appear as the shrinkage operator in the solution of LASSO (Tibshirani, 1996) when the design matrix is orthonormal. The studies on LASSO further derives the Iterative Shrinkage-Thresholding Algorithm (ISTA) (Daubechies et al., 2004; Elad, 2006) , which used to be popularized in sparse recovery and compressed sensing. ISTA has many variants (Bioucas-Dias & Figueiredo, 2007; Beck & Teboulle, 2009b; Bayram & Selesnick, 2010) and has long been certified as an effective sparsification methods in all sorts of fields like deep learning (He et al., 2017; Zhang et al., 2018; Bai et al., 2020 ), computer vision (Beck & Teboulle, 2009a; Dong et al., 2013 ), medical imageology (Lustig et al., 2007; Otazo et al., 2015) and geophysics (Herrmann & Hennenfent, 2008) . Despite an abecedarian analysis on the similarity between STDS and ISTA, many issues remains to be addressed, such as 1) the exact equivalence between ISTA and the growing threshold in soft threshold pruning, 2) the necessity of setting threshold trainable in STR, and 3) the way to improve existing methods without exhaustively trying different tricks for scheduling threshold. In this work, we proposed a theoretical framework serving as a bridge between the underlying L 1regularized optimization problem and threshold scheduling. The bridge is built upon the key finding that soft threshold pruning is an implicit ISTA for nonzero weights. Specifically, we prove that the L 1 coefficient in the underlying optimization problem is determined by both threshold and learning rate. In this way, any threshold tuning strategy can now be interpreted as a scheme for tuning L 1 penalty. We find that a time-invariant L 1 coefficient lead to performance towering over previous pruning studies. Moreover, we bring a strategy of tuning L 1 penalty called continuation strategy (Xiao & Zhang, 2012) , which was once all the rage in the field of sparse optimization, to the field of pruning. It derives broad categories of algorithms covering several tracks in the present taxonomy of pruning. In brief, our contributions are summarized as follows: • Theoretical cornerstone of threshold tuning strategy. To the best of our knowledge, this is the first work interpreting increasing threshold as an ever-changing regularized term. Under theoretical analysis, we present a unified framework for the local equivalence of ISTA and soft threshold pruning. It enables us to make a comprehensive study on threshold tuning using the classic method in sparse optimization. • Learning rate adapted threshold scheduler. Through our proposed framework, we reveal the strong relation between the learning rate scheduler and the threshold scheduler. Then we show that an time-invariant L 1 coefficient requires the changing of threshold being proportional to the learning rate. The Learning rate Adapted Threshold Scheduler (LATS) built upon L 1 coefficient achieves a state-of-the-art performance-sparsity tradeoff on both deep ANNs and SNNs. • Sibling schedulers cover multiple tracks of pruning. We propose an early pruning algorithm by translating the homotopy continuation algorithm into a pruning algorithm with



For example, STR(Kusupati et al., 2020) is the first to achieve >50% Top-1 accuracy of ImageNet on ResNet-50 under >99% sparsity. STDS(Chen et al., 2022) is the first pruning algorithm achieving acceptable performance degradation (∼ 3% under 88.8% sparsity) for spiking neural networks with 18+ layers.

