WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS?

Abstract

Auto-regressive large language models such as GPT-3 require enormous computational resources to use, leading to huge financial cost and environmental impact. Structured pruning methods traditionally reduce resource usage, however, their application to and efficacy for generative language models is heavily under-explored. We analyze the effects of magnitude, random, and movement (Lagunas et al., 2021) pruning on MLP layers in GPT-like models. We find that movement can underperform for these models while random pruning nearly matches the best methods. By examining neuron-level redundancy measures, we discover that movement does not select neurons based on how unique they are compared to other neurons, leaving behind excess redundancy. In view of this, we introduce Globally Unique Movement (GUM) to select neurons based on both uniqueness and sensitivity. We then discuss the roles of our techniques on different redundancy metrics through careful comparisons and ablations.

1. INTRODUCTION

Large language models (LLMs), such as the state-of-the-art GPT-3 model (Brown et al., 2020) with up to 175 billion parameters, have achieved remarkable performance in natural language processing (NLP) tasks. However, training and deploying such massive models also poses significant challenges in terms of computational cost, energy consumption, and environmental impact. Therefore, it is crucial to develop effective methods to reduce the size of LLMs without compromising their quality. Neural network pruning is a long-standing model compression method (Janowsky, 1989; Mozer & Smolensky, 1988; Frankle & Carbin, 2018; Karnin, 1990; Blalock et al., 2020) . It can be broadly classified into two types: unstructured and structured. Unstructured pruning removes individual weights from the network based on some criteria, such as magnitude or movement, resulting in sparse weight matrices that can be stored and processed more efficiently. Structured pruning, on the other hand, eliminates whole components, such as neurons, channels, or blocks, leading to smaller architectures to reduce end-to-end inference latency. While unstructured pruning has been extensively studied and applied to LLMs (Wang et al., 2020b; Xu et al., 2021; Zafrir et al., 2021; Li et al., 2022) , structured pruning is more challenging and less explored. However, structured pruning is also more desirable in many practical scenarios, such as deploying these models on resource-constrained devices or providing fast and reliable services based on LLMs. Existing work on structured pruning for LLMs focuses on BERT-like networks (Devlin et al., 2018) that consist of an encoder-decoder or an encoder-only architecture (Li et al., 2020; Xia et al., 2022; Zhang et al., 2022; Yao et al., 2021) . These models are mainly used for natural language understanding (NLU) tasks, such as question answering, sentiment analysis, or natural language inference. Among the various methods, Block Movement Pruning (Lagunas et al., 2021 ) is a recent and popular technique that removes weight blocks based on movement. However, there is a lack of systematic research on structured pruning for decoder-only architectures such as GPT-2 Radford et al. (2019) , GPT-3 Brown et al. (2020) , or GPT-Neo Black et al. (2021) , which are mainly used for natural language generation (NLG) tasks, such as text summarization, machine translation, or text completion. While there are some works that apply unstructured pruning (Li et al., 2022) or many kinds of orthogonal compression techniques to decoder-only LLMs (Wang et al., 2020a; Li et al., 2021; Edalati et al., 2022; Tao et al., 2022; Xu & Hu, 2022; Chen et al., 2021) , there is no comprehensive evaluation of traditional structured pruning for these models on NLG tasks.

