OVIT: AN ACCURATE SECOND-ORDER PRUNING FRAMEWORK FOR VISION TRANSFORMERS

Abstract

Models from the Vision Transformer (ViT) family have recently provided breakthrough results across image classification tasks such as ImageNet. Yet, they still face barriers to deployment, notably the fact that their accuracy can be severely impacted by compression techniques such as pruning. In this paper, we take a step towards addressing this issue by introducing Optimal ViT Surgeon (oViT), a new state-of-the-art weight sparsification method, which is particularly well-suited to Vision Transformers (ViT) models. At the technical level, oViT introduces a new weight pruning algorithm which leverages second-order information, and in particular can handle weight correlations accurately and efficiently. We complement this accurate one-shot pruner with an in-depth investigation of gradual pruning, augmentation, and recovery schedules for ViTs, which we show to be critical for successful compression. We validate our method via extensive experiments on classical ViT and DeiT models, hybrid architectures, such as XCiT, EfficientFormer and Swin, as well as general models, such as highly-accurate ResNet and Efficient-Net variants. Our results show for the first time that ViT-family models can in fact be pruned to high sparsity levels (e.g. ≥ 75%) with low impact on accuracy (≤ 1% relative drop). In addition, we show that our method is compatible with structured pruning methods and quantization, and that it can lead to significant speedups on a sparsity-aware inference engine.

1. INTRODUCTION

Attention-based Transformers (Vaswani et al., 2017) have revolutionized natural language processing (NLP), and have become popular recently also in computer vision (Dosovitskiy et al., 2020; Touvron et al., 2021; Carion et al., 2020) . The Vision Transformer (ViT) (Dosovitskiy et al., 2020; Touvron et al., 2021) and its extensions (Ali et al., 2021; Liu et al., 2021; Wang et al., 2021) which are the focus of our study, have been remarkably successful, despite encoding fewer inductive biases. However, the high accuracy of ViTs comes at the cost of large computational and parameter budgets. In particular, ViT models are well-known to be more parameter-heavy (Dosovitskiy et al., 2020; Touvron et al., 2021) , relative to their convolutional counterparts. Consequently, a rapidly-expanding line of work has been focusing on reducing these costs for ViT models via model compression, thus enabling their deployment in resource-constrained settings. Several recent references adapted compression approaches to ViT models, investigating either structured pruning, removing patches or tokens, or unstructured pruning, removing weights. The consensus in the literature is that ViT models are generally less compressible than convolutional networks (CNNs) at the same accuracy. If the classic ResNet50 model (He et al., 2016) can be compressed to 80-90% sparsity with negligible loss of accuracy, e.g. (Frantar et al., 2021; Peste et al., 2021) , the best currently-known results for similarly-accurate ViT models can only reach at most 50% sparsity while maintaining dense accuracy (Chen et al., 2021) . It is therefore natural to ask whether this "lack of compressibility" is an inherent limitation of ViTs, or whether better results can be obtained via improved compression methods designed for these architectures. Contributions. In this paper, we propose a new pruning method called Optimal ViT Surgeon (oViT) , which improves the state-of-the-art accuracy-vs-sparsity trade-off for ViT family models, and shows that they can be pruned to similar levels as CNNs. Our work is based on an in-depth investigation of ViT performance under pruning, and provides contributions across three main directions: • A new second-order sparse projection. To address the fact that ViTs tend to lose significant accuracy upon each pruning step, we introduce a novel approximate second-order pruner called oViT, inspired by the classical second-order OBS framework (Hassibi et al., 1993) . The key new feature of our pruner is that, for the first time, it can handle weight correlations during pruning both accurately and efficiently, by a new theoretical result which reduces weight selection to a sparse regression problem. This approach leads to state-of-the-art results one-shot pruning for both ViTs and conventional models (e.g. ResNets). • Post-pruning recovery framework. To address the issue that ViTs are notoriously hard to train and fine-tune (Touvron et al., 2021; Steiner et al., 2021) , we provide a set of efficient sparse fine-tuning recipes, enabling accuracy recovery at reasonable computational budgets. • End-to-end framework for sparsity sweeps. Our accurate oViT pruner enables us to avoid the standard and computationally-heavy procedure of gradual pruning for every sparsity target independently, e.g. (Gale et al., 2019; Singh & Alistarh, 2020) . Instead, we propose a simple pruning framework that produces sparse accurate models for a sequence of sparsity targets in a single run, accommodating various deployments under a fixed compute budget. Our experiments focus on the standard ImageNet-1K benchmark (Russakovsky et al., 2015) . We show that, under low fine-tuning budgets, the oViT approach matches or improves upon the stateof-the-art SViTE (Chen et al., 2021) unstructured method at low-to-medium (40-50%) sparsities, and significantly outperforms it at higher sparsities (≥ 60%) required to obtain inference speedups. Specifically, our results show that, at low targets (e.g. 40-50%), sparsity acts as a regularizer, sometimes improving the validation accuracy relative to the dense baseline, by margins between 0.5% and 1.8% Top-1 accuracy. At the same time, we show for the first time that ViT models can attain high sparsity levels without significant accuracy impact: specifically, we can achieve 75-80% sparsity with relatively minor (<1%) accuracy loss. Figure 1 summarizes our results. Conceptually, we show that ViT models do not require over-parametrization to achieve high accuracy, and that, post-pruning, they can be competitive with residual networks in terms of accuracy-per-parameter. Practically, we show that the resulting sparse ViTs can be executed with speedups on a sparsity-aware inference engine (Kurtz et al., 2020) Steiner et al., 2021) . We propose new and general recipes for fine-tuning ViTs, which should be useful to the community. Several prior works have investigated ViT compression, but focusing on structured pruning, such as removing tokens (Zhu et al., 2021; Kim et al., 2021; Xu et al., 2021; Pan et al., 2021; Song et al., 2022; Rao et al., 2021; Hou & Kung, 2022) . We show experimentally that these approaches are orthogonal to unstructured pruning, which can be applied in conjunction to further compress these models. Unstructured pruning, on which we focus here, considers the problem of removing individual network weights, which can be leveraged for computational savings (Kurtz et al., 2020; Hoefler et al., 2021) . The only existing prior work on unstructured ViT pruning is SViTE (Chen et al., 2021) , which applied the general RigL pruning method (Evci et al., 2020) to the special case of ViT models. We also present results relative to well-tuned magnitude pruning, the best existing second-order pruners (Singh &



Figure1: Validation accuracy versus nonzero parameters for DeiT-Tiny, -Small and -Base models, as well as a highlyaccurate ResNet50D model, pruned to {50%, 60%, 75%, 80%, 90%} sparsities using iterative Global Magnitude (GM), SViTE, and oViT methods.

Overall, our correlation-aware pruner outperforms existing approaches on ViTs, CNNs (ResNet and EfficientNet), and language models (RoBERTa). Specifically for ViTs, our recovery recipes and sparsity sweep framework enable oViT to achieve state-of-the-art compression, roughly doubling sparsity at similar accuracy (Chen et al., 2021), with only half the compression budget. Related work. Vision Transformers (ViTs)(Dosovitskiy et al., 2020)  have set new accuracy benchmarks, but are known to require careful tuning in terms of both augmentation and training hyperparameters. Identifying efficient recipes is an active research topic in itself(Touvron et al., 2021;

