WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS?

Abstract

Auto-regressive large language models such as GPT-3 require enormous computational resources to use, leading to huge financial cost and environmental impact. Structured pruning methods traditionally reduce resource usage, however, their application to and efficacy for generative language models is heavily under-explored. We analyze the effects of magnitude, random, and movement (Lagunas et al., 2021) pruning on MLP layers in GPT-like models. We find that movement can underperform for these models while random pruning nearly matches the best methods. By examining neuron-level redundancy measures, we discover that movement does not select neurons based on how unique they are compared to other neurons, leaving behind excess redundancy. In view of this, we introduce Globally Unique Movement (GUM) to select neurons based on both uniqueness and sensitivity. We then discuss the roles of our techniques on different redundancy metrics through careful comparisons and ablations.

1. INTRODUCTION

Large language models (LLMs), such as the state-of-the-art GPT-3 model (Brown et al., 2020) with up to 175 billion parameters, have achieved remarkable performance in natural language processing (NLP) tasks. However, training and deploying such massive models also poses significant challenges in terms of computational cost, energy consumption, and environmental impact. Therefore, it is crucial to develop effective methods to reduce the size of LLMs without compromising their quality. Neural network pruning is a long-standing model compression method (Janowsky, 1989; Mozer & Smolensky, 1988; Frankle & Carbin, 2018; Karnin, 1990; Blalock et al., 2020) . It can be broadly classified into two types: unstructured and structured. Unstructured pruning removes individual weights from the network based on some criteria, such as magnitude or movement, resulting in sparse weight matrices that can be stored and processed more efficiently. Structured pruning, on the other hand, eliminates whole components, such as neurons, channels, or blocks, leading to smaller architectures to reduce end-to-end inference latency. While unstructured pruning has been extensively studied and applied to LLMs (Wang et al., 2020b; Xu et al., 2021; Zafrir et al., 2021; Li et al., 2022) , structured pruning is more challenging and less explored. However, structured pruning is also more desirable in many practical scenarios, such as deploying these models on resource-constrained devices or providing fast and reliable services based on LLMs. Existing work on structured pruning for LLMs focuses on BERT-like networks (Devlin et al., 2018) that consist of an encoder-decoder or an encoder-only architecture (Li et al., 2020; Xia et al., 2022; Zhang et al., 2022; Yao et al., 2021) . These models are mainly used for natural language understanding (NLU) tasks, such as question answering, sentiment analysis, or natural language inference. Among the various methods, Block Movement Pruning (Lagunas et al., 2021) is a recent and popular technique that removes weight blocks based on movement. However, there is a lack of systematic research on structured pruning for decoder-only architectures such as GPT-2 Radford et al. ( 2019), GPT-3 Brown et al. (2020) , or GPT-Neo Black et al. (2021) , which are mainly used for natural language generation (NLG) tasks, such as text summarization, machine translation, or text completion. While there are some works that apply unstructured pruning (Li et al., 2022) or many kinds of orthogonal compression techniques to decoder-only LLMs (Wang et al., 2020a; Li et al., 2021; Edalati et al., 2022; Tao et al., 2022; Xu & Hu, 2022; Chen et al., 2021) , there is no comprehensive evaluation of traditional structured pruning for these models on NLG tasks. In this work, we compress decoder-only auto-regressive language models. Due to the lack of prior literature towards the same goal, we evaluate the performance of several general-domain pruning methods on NLG tasks, including magnitude and movement pruning. However, we find these methods can struggle or under-perform compared to naïve baselines, leading to the following question: What determines the performance of structured pruning on generative language models? We aim to fill this gap by conducting a systematic study of structured fine-pruning (pruning while finetuning) methods for decoder-only LLMs on NLG tasksfoot_0 , and further proposing a novel method that combines the strengths of different existing methods. Our main contributions are: • To our knowledge, we perform the first systematic evaluation of several structured pruning methods to decoder-only LLMs on NLG tasks. We find that they only achieve marginal improvements over randomly selecting neurons in finetuning. We explain their limitations under our proposed analysis framework, and characterize their advantages and disadvantages via the metrics we evaluated. • We propose an empirical analysis framework for structured pruning that relies on two fundamental measures of redundancy: sensitivity and uniqueness. Sensitivity reflects how much the removal of a network component affects the output of the model, while uniqueness reflects how much the information provided by a network component differs from others. Our framework allows us to understand and compare the behavior and performance of different pruning methods. • To show the impact made possible by our analysis, we propose a proof-of-concept method, Globally Unique Movement (GUM), that aims to maximize both sensitivity and uniqueness by pruning network components based on their global movement and local uniqueness scores. GUM outperforms the existing methods on several NLG tasks and achieves competitive compression rates, proving that future methods should preserve both sensitivity and uniqueness. We also conduct ablation studies to validate the effectiveness of our method components and design choices.

2. BACKGROUND & METHODOLOGY

There are many general-domain pruning methods. We focus on fine-pruning, a relevant technique for language models which performs automated gradual pruning (Zhu & Gupta, 2017) while fine-tuning. We focus on pruning the MLPs of generative models. At inferencing time, generative models can cache attention vector states. Therefore, especially in large models, MLPs account for more time than attention for new tokens. MLPs also seem to store factual knowledge (Petroni et al., 2019; Meng et al., 2022) , making their reduction possibly challenging.

Notation and Background

We shall define some notations for the MLP layers. Let σ(•) : R m → R m be an element-wise activation function (e.g. GeLU), and let W 1 ∈ R m×d , W 2 ∈ R d×m be two weight matrices and b ∈ R m be the bias vector. For an input token x ∈ R d , the MLP layer output of x is expressed as MLP (x) = x + W 2 h(x) with intermediate output h(x) = σ(W 1 LN(x)) , where LN represents layer normalization. We use ⊙ to denote element-wise multiplications. Lastly, we use L to denote the loss of the task. We study methods of reducing m, the intermediate dimension, which is usually set at m = 4d. Movement Pruning Movement Pruning (Sanh et al., 2020 ) is a popular fine-pruning method. In this paper, we focus on the block version of movement pruning (Lagunas et al., 2021) , and we first introduce the original unstructured method. Let L(W ) be the task loss with weight parameters W . For each weight parameter W i,j , we compute a accumulated score S i,j at iteration T , by the following expressionfoot_1 : S (T ) i,j = -η S t≤T W (t) i,j • ∂L(W (t) ) ∂W i,j



All code publicly available at (removed for peer-review). Gradients are calculated straight-through to the mask scores, otherwise it is undefined(Bengio et al., 2013).

