CRAM: A COMPRESSION-AWARE MINIMIZER

Abstract

Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trained via CrAM should be compressible post-training, in a single step, without significant accuracy loss. Experimental results on standard benchmarks, such as residual networks for ImageNet classification and BERT models for language modelling, show that CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: specifically, we can prune models in one-shot to 70-80% sparsity with almost no accuracy loss, and to 90% with reasonable (∼ 1%) accuracy loss, which is competitive with gradual compression methods. Additionally, CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.

1. INTRODUCTION

The massive recent progress of deep learning models has been accompanied by an increase in computational costs (Thompson et al., 2020) . In turn, this has led to significant interest in model compression techniques in order to reduce these costs. For many existing models, compression techniques such as distillation (Hinton et al., 2015) , pruning (Hoefler et al., 2021) and quantization (Gholami et al., 2021) can usually reduce the number of parameters or FLOPs of a given model by up to an order of magnitude with relatively little accuracy loss. However, performant compression still usually requires re-training or fine-tuning the model separately for each compression target, provided by the user as a target sparsity and/or quantization level. In turn, this compression process can be cumbersome and error-prone, as it requires additional computation and hyper-parameter tuning for each run. In this work, we propose Compression-Aware Minimization (CrAM), a method for training neural networks, which results in models that are easily compressible one-shot, while still being highlyaccurate. Specifically, CrAM enables training a single (dense) model, which can later be compressed to different target levels, with minimal or no recalibration. Such flexibility is desirable, as models can be trained once, and then deployed on multiple devices, with different specifications. Having a single model that can be easily configured to meet the computational requirements of a specific device can both reduce the overall computational cost, and also allow easier customization to individual devices. CrAM is loosely-inspired by the recently-introduced sharpness-aware minimizer (SAM) (Foret et al., 2021) , which trains models that potentially converge to flatter minima, leading to better generalization compared to SGD-type baselines, by biasing the process towards minima of uniformly low loss. Multiple subsequent works have investigated and improved upon the original SAM algorithm, by either obtaining better generalization (Kwon et al., 2021) , or by reducing the computational costs of SAM training (Liu et al., 2020; Du et al., 2022a) . We are the first to carry over this idea to the task of obtaining compressible models. Roughly speaking, CrAM works by optimizing not against the original "dense" model, but over a compression projection applied to the intermediate model iterate, at every optimization step. Thus, the CrAM update aims to bias optimization towards iterates that have both low loss and are robust under one-shot compression. Similarly to SAM, CrAM is simple to implement as part of a regular training loop and has a single scaling hyper-parameter, for which we provide a well-performing default value. We detail the CrAM algorithm and provide a theoretical motivation leveraging fundamental results in robust optimization (Danskin, 2012) in Section 3. To complement our algorithmic contribution, we perform an extensive experimental analysis of CrAM. We mainly focus on compression via weight pruning, but we also show that CrAM is compatible with weight quantization. Generally, CrAM models trained on large-scale image classification or language modelling tasks can improve over the dense baseline performance, while being very robust to one-shot pruning, at different sparsity levels. For image classification, CrAM can train a highly-accurate dense ResNet50 model on ImageNet, that can be pruned in one-shot to 80% and 90% sparsity, and is competitive in terms of accuracy relative to state-of-the-art gradual pruning methods, following an inexpensive Batch Normalization re-tuning step on a small calibration set. Moreover, we show that full CrAM training is not necessary for good performance: specifically, a short CrAM finetuning period is sufficient to substantially improve one-shot pruning accuracy. For instance, using CrAM to transfer the standard BERT-base model (Devlin et al., 2019) on SQuADv1.1 question-answering (Rajpurkar et al., 2016) , we obtain models that are both more accurate and more compressible than with optimizers such as Adam (Kingma & Ba, 2015) or SAM (Foret et al., 2021) . In addition, a short (≤ 2 epochs) finetuning of the sparse model can provide substantial additional improvements: the 80%-sparse CrAM finetuned model reaches higher accuracy than the highlycompetitive gradual pruning methods PLATON (Zhang et al., 2022) and Movement Pruning (Sanh et al., 2020) , at a fraction of the training budget. CrAM lends itself to several extensions: it can be used with different layer-wise sparsity distributions, semi-structured N:M sparsity patterns, and one-shot pruning techniques. Sparse CrAM models can be successfully used for sparse transfer learning, where they can perform well on a wide range of "downstream" target tasks, even when compared to pruning methods that train a separate model for each sparsity level. We also provide evidence that the CrAM update can produce models that are robust to quantization. Similar to SAM (Foret et al., 2021) , one limitation is the added computational cost, as CrAM requires an additional backwards pass for the model perturbation. This can be addressed by only performing limited finetuning via CrAM instead of full retraining, or by only performing a regular optimization step for a fraction of the time, both of which we show to have a limited impact on accuracy. Moreover, our approach is also compatible with efficient SAM-type updates (Liu et al., 2020; Du et al., 2022a) . We also provide a well-performing variant of CrAM that uses sparse gradients, which could be leveraged by frameworks with support for sparse back-propagation (Nikdan et al., 2023) .

2. RELATED WORK

Sharpness-Aware Minimization (SAM). The recently introduced SAM optimizer (Foret et al., 2021) aims to minimize loss sharpness; this in turn should lead to flatter local minima, with better generalization. The authors show that SAM-trained models have higher validation accuracy compared to vanilla SGD-type baselines, that their performance continues to improve with prolonged training, and that they can also be successfully used for transfer learning. One important drawback of SAM is its computational overhead, as it requires twice as many forward-backward passes through the network. Subsequent work has focused on reducing computational cost by, for example, reducing the frequency of the extra gradient steps (Liu et al., 2022) , computing the perturbations on a subset of the parameters (Du et al., 2022a) , or by proposing a new trajectory loss to replace the sharpness definition (Du et al., 2022b) . We draw inspiration from properties of the initial SAM method proposed by Foret et al. (2021) . Instead of attempting to minimize the maximum local increase loss (sharpness), our goal is to minimize the maximum local increase in loss due to compression. Training prunable networks. The increasing scale of deep neural networks have made their deployment to edge devices dependent on compression techniques, such as quantization and/or pruning. While post-training quantization can be efficient and successful without any retraining (Frantar & Alistarh, 2022) , in the case of pruning the gold standard is still training a separate model for every target sparsity level (Zhu & Gupta, 2017; Singh & Alistarh, 2020; Evci et al., 2020; Peste et al., 2021) , which can be expensive. A potential solution would be training a single dense model, which either contains multiple smaller ones that can be easily deployed, or which is itself prunable at multiple sparsity levels, without additional retraining. For example, "once-for-all" (OFA) (Cai et al., 2019) can train a large network that contains multiple specialized sub-nets, adapted to different resource constraint devices. However, obtaining the large OFA network is extremely expensive, and requires

availability

https://github.com/

