COMPRESSION-AWARE TRAINING OF NEURAL NET-WORKS USING FRANK-WOLFE

Abstract

Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm, 'compression-aware'training, obtains state-of-the-art dense models which are robust to a wide range of compression ratios using a single dense training run while also avoiding retraining. In that vein, we propose a constrained optimization framework centered around a versatile family of norm constraints and the Stochastic Frank-Wolfe (SFW) algorithm which together encourage convergence to well-performing solutions while inducing robustness towards convolutional filter pruning and lowrank matrix decomposition. Comparing our novel approaches to compression methods in these domains on benchmark image-classification architectures and datasets, we find that our proposed scheme is able to yield competitive results, often outperforming existing compression-aware approaches. In the case of low-rank matrix decomposition, our approach can require much less computational resources than nuclear-norm regularization based approaches by requiring only a fraction of the singular values in each iteration. As a special case, our proposed constraints can be extended to include the unstructured sparsity-inducing constraint proposed by Pokutta et al. (2020) and Miao et al. (2022), which we improve upon. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.

1. INTRODUCTION

The astonishing success of Deep Neural Networks relies heavily on over-parameterized architectures (Zhang et al., 2016a) containing up to several billions of parameters. Consequently, modern networks require large amounts of storage and increasingly long, computationally intensive training and inference times, entailing tremendous financial and environmental costs (Strubell et al., 2019) . To address this, a large body of work focuses on compressing networks, resulting in sparse models that require only a fraction of memory or floating-point operations while being as performant as their dense counterparts. Recent techniques include the pruning of individual parameters (LeCun et al., 1989; Hassibi & Stork, 1993; Han et al., 2015; Gale et al., 2019; Lin et al., 2020; Blalock et al., 2020) or group entities such as convolutional filters and entire neurons (Li et al., 2016; Alvarez & Salzmann, 2016; Liu et al., 2018; Yuan et al., 2021) , the utilization of low-bit representations of networks (quantization) (Wang et al., 2018; Kim et al., 2020) as well as classical matrix-or tensor-decompositions (Zhang et al., 2016b; Tai et al., 2015; Xu et al., 2020; Liebenwein et al., 2021) in order to reduce the number of parameters. While there is evidence of pruning being beneficial for the ability of a model to generalize (Blalock et al., 2020; Hoefler et al., 2021) , a higher degree of sparsification will typically lead to a degradation in the predictive power of the network. To reduce this impact, two main paradigms have emerged. Pruning after training, most prominently exemplified by Iterative Magnitude Pruning (IMP) (Han et al., 2015) , forms a class of algorithms characterized by a three-stage pipeline of regular (sparsityagnostic) training followed by prune-retrain cycles that are either performed once (One-Shot) or iteratively. The need for retraining to recover pruning-induced losses is often considered to be an inherent disadvantage and computationally impractical (Liu et al., 2020; Ding et al., 2019; Wortsman et al., 2019; Lin et al., 2020) . In that vein, pruning during training or regularization approaches avoid

