SPARSITY MAY CRY: LET US FAIL (CURRENT) SPARSE NEURAL NETWORKS TOGETHER!

Abstract

Sparse Neural Networks (SNNs) have received voluminous attention predominantly due to growing computational and memory footprints of consistently exploding parameter count in large-scale models. Similar to their dense counterparts, recent SNNs generalize just as well and are equipped with numerous favorable benefits (e.g., low complexity, high scalability, and robustness), sometimes even better than the original dense networks. As research effort is focused on developing increasingly sophisticated sparse algorithms, it is startling that a comprehensive benchmark to evaluate the effectiveness of these algorithms has been highly overlooked. In absence of a carefully crafted evaluation benchmark, most if not all, sparse algorithms are evaluated against fairly simple and naive tasks (eg. CIFAR-10/100, ImageNet, GLUE, etc.), which can potentially camouflage many advantages as well unexpected predicaments of SNNs. In pursuit of a more general evaluation and unveiling the true potential of sparse algorithms, we introduce "Sparsity May Cry" Benchmark (SMC-Bench), a collection of carefully-curated 4 diverse tasks with 10 datasets, that accounts for capturing a wide range of domain-specific and sophisticated knowledge. Our systemic evaluation of the most representative sparse algorithms reveals an important obscured observation: the state-of-the-art magnitude-and/or gradient-based sparse algorithms seemingly fail to perform on SMC-Bench when applied out-of-the-box, sometimes at significantly trivial sparsity as low as 5%. The observations seek the immediate attention of the sparsity research community to reconsider the highly proclaimed benefits of SNNs. We further conduct a thorough investigation into the reasons for the failure of common SNNs. Our analysis points out that such failure is intimately related to the "lazy regime" of large model training, which hints us with stronger pruning recipes that alleviate the failure on SMC-Bench (though still more or less suffering). By incorporating these well-thought and diverse tasks, SMC-Bench is designed to favor and encourage the development of more scalable and generalizable sparse algorithms. We open-source SMC-Bench to assist researchers in building next-generation sparse algorithms that scale and generalize: https://github.com/VITA-Group/SMC-Bench.

1. INTRODUCTION

Sparse Neural Networks (SNNs) are no stranger to the deep learning community (Liu & Wang, 2023) , but recently they have received stupendous attention in the era of transformers (eg. BERT, GPT, ViT, CLIP, etc.), when the parameter count is frequently measured in billions rather than millions. Due to the consistent efforts of sparsity researchers, SNNs have ushered enormous breakthroughs and can generalize just as well as original dense networks, and it is feasible to procure them after training (Frankle & Carbin, 2019; Sanh et al., 2020; Chen et al., 2020; Frankle et al., 2020 ), during training (Zhu & Gupta, 2017; Gale et al., 2019; Liu et al., 2021b) , and even before training (Mocanu et al., 2018; Lee et al., 2019; Liu et al., 2022) their dense counterparts using pruning. Apart from well-known efficiency benefits, surprisingly, SNNs also enjoy auxiliary benefits such as adversarial robustness (Guo et al., 2018; Özdenizci & Legenstein, 2021; Chen et al., 2022) , out-of-distribution generalization (Zhang et al., 2021; Diffenderfer et al., 2021) , and uncertainty estimation (Liu et al., 2021a) , etc. Despite the multi-dimensional success of numerous sparse algorithms, startlingly, our extensive survey across over 100 recent SNN papers within 2015-2022 unveils multiple daunting issues regarding evaluation datasets and protocols blindly followed within the sparsity community, that may significantly impede future progress if left unacknowledged. Issues with current evaluation paradigm: Firstly, the vast majority of current work on SNNs is narrowly evaluated, i.e., only targeting a single or a few tasks (usually on image classification and sometimes on language understanding) on which SNNs have already proven their proficiency (Gale et al., 2019; Frankle & Carbin, 2019) . Surprisingly, 79 papers out of our carefully selected 100 papers on SNNs, evaluate sparse models merely on a single task, where 72 out of them evaluate image classification. Secondly, people are obsessed with evaluating SNNs on well-understood datasets, including but not limited to MNIST (LeCun, 1998) et al., 2018 ) (9 papers), where deep neural networks have already exceeded the human-equivalent performance (refer to Appendix D for more details). For instance, even though ImageNet has been considered a rather challenging task over years, very high accuracy (>90%) has been reported many times (Yu et al., 2022; Wortsman et al., 2022; Zhai et al., 2022) . Such relatively restricted evaluations with "nearly saturated performance" limit the scope of sparse neural networks and are potentially ill-suited to identify new and unexpected capabilities of SNNs. Addressing the aforementioned limitations of current SNN evaluation protocols is a pressing need for the community. To this end, we assemble a large-scale, fairly arduous, and diverse benchmark for sparse neural networks -"Sparsity May Cry" Benchmark (or briefly SMC-Bench). Specifically, we consider a broad set of tasks including complex reasoning, multilingual translation, and protein prediction, whose content spans multiple domains. Those tasks require a vast amount of commonsense knowledge, solid mathematical and scientific backgrounds to solve even for humans. Note that none of the datasets in SMC-Bench was created from scratch for the benchmark, we rely on pre-existing datasets as they have been agreed by researchers as challenging, interesting, and of high practical value. We rigorously measure the performance of state-of-the-art (SOTA) pruning and sparse training approaches (in their most common, basic settings) on SMC-Bench, to understand the potential of SNNs to scale and generalize. Our key observations and contributions can be unfolded as: • We present "Sparsity May Cry" Benchmark, to re-define the evaluation protocols for sparse neural networks and facilitate a comprehensive assessment of SOTA sparse algorithms. The premise of SMC-bench is to develop a suite of large-scale, challenging, realistic, and diverse tasks and datasets that can empower the rigorous advancements in the community. • SMC-Bench unveils a critical and startling observation -all of the SOTA sparse algorithms seem to fail on SMC-Bench "out-of-the-box", sometimes at significantly trivial sparsity e.g., 5%. Note that the failure does not appear specific to one sparsification approach but unanimously across all approaches we evaluated. This observation alarmingly demands the attention of the sparsity community to reconsider the highly proclaimed benefits of SNNs. • We conduct extensive experiments across representative SNNs produced by various SOTA pruning and sparse training approaches on SMC-Bench, and we summarize our findings: ❶ Model prunability is intimately related to task difficulty: models trained on difficult tasks suffer more from pruning compared to easier tasks; ❷ The success of before-training sparsification (sparse training or pruning at initialization) is hard to generalize in more complex scenarios; ❸ Iterative magnitude pruning (IMP) does not necessarily generalize better than one-shot pruning (OMP) or during-training pruning; ❹ Despite performance difference, different magnitude-based pruning approaches lead to extremely similar layerwise sprasities. • We further carry out a comprehensive investigation into the potential causes of SNN failures on SMC-Bench. Our analysis suggests that the failure of the existing sparse algorithms might be due to the "lazy regime" dynamics emerging in sufficiently overparameterized models (Chizat et al., 2019; Malladi et al., 2022) . Inspired by this finding, we hypothesize and confirm that the second-order pruning approaches, i.e., oBERT (Kurtic et al., 2022) , are more reliable pruning approaches for SMC-Bench, which yield relatively more promising performance on SMC-Bench in Appendix C.



(26 papers), CIFAR-10/100 (Krizhevsky et al., 2009) (59 and 37 papers, respectively), ImageNet (Deng et al., 2009) (62 papers), and GLUE (Wang

