WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS?

Abstract

Auto-regressive large language models such as GPT-3 require enormous computational resources to use, leading to huge financial cost and environmental impact. Structured pruning methods traditionally reduce resource usage, however, their application to and efficacy for generative language models is heavily under-explored. We analyze the effects of magnitude, random, and movement (Lagunas et al., 2021) pruning on MLP layers in GPT-like models. We find that movement can underperform for these models while random pruning nearly matches the best methods. By examining neuron-level redundancy measures, we discover that movement does not select neurons based on how unique they are compared to other neurons, leaving behind excess redundancy. In view of this, we introduce Globally Unique Movement (GUM) to select neurons based on both uniqueness and sensitivity. We then discuss the roles of our techniques on different redundancy metrics through careful comparisons and ablations. Large language models (LLMs), such as the state-of-the-art GPT-3 model (Brown et al., 2020) with up to 175 billion parameters, have achieved remarkable performance in natural language processing (NLP) tasks. However, training and deploying such massive models also poses significant challenges in terms of computational cost, energy consumption, and environmental impact. Therefore, it is crucial to develop effective methods to reduce the size of LLMs without compromising their quality.



In this work, we compress decoder-only auto-regressive language models. Due to the lack of prior literature towards the same goal, we evaluate the performance of several general-domain pruning methods on NLG tasks, including magnitude and movement pruning. However, we find these methods can struggle or under-perform compared to naïve baselines, leading to the following question: What determines the performance of structured pruning on generative language models? We aim to fill this gap by conducting a systematic study of structured fine-pruning (pruning while finetuning) methods for decoder-only LLMs on NLG tasksfoot_0 , and further proposing a novel method that combines the strengths of different existing methods. Our main contributions are: • To our knowledge, we perform the first systematic evaluation of several structured pruning methods to decoder-only LLMs on NLG tasks. We find that they only achieve marginal improvements over randomly selecting neurons in finetuning. We explain their limitations under our proposed analysis framework, and characterize their advantages and disadvantages via the metrics we evaluated. • We propose an empirical analysis framework for structured pruning that relies on two fundamental measures of redundancy: sensitivity and uniqueness. Sensitivity reflects how much the removal of a network component affects the output of the model, while uniqueness reflects how much the information provided by a network component differs from others. Our framework allows us to understand and compare the behavior and performance of different pruning methods. • To show the impact made possible by our analysis, we propose a proof-of-concept method, Globally Unique Movement (GUM), that aims to maximize both sensitivity and uniqueness by pruning network components based on their global movement and local uniqueness scores. GUM outperforms the existing methods on several NLG tasks and achieves competitive compression rates, proving that future methods should preserve both sensitivity and uniqueness. We also conduct ablation studies to validate the effectiveness of our method components and design choices.

2. BACKGROUND & METHODOLOGY

There are many general-domain pruning methods. We focus on fine-pruning, a relevant technique for language models which performs automated gradual pruning (Zhu & Gupta, 2017) while fine-tuning. We focus on pruning the MLPs of generative models. At inferencing time, generative models can cache attention vector states. Therefore, especially in large models, MLPs account for more time than attention for new tokens. MLPs also seem to store factual knowledge (Petroni et al., 2019; Meng et al., 2022) , making their reduction possibly challenging.

Notation and Background

We shall define some notations for the MLP layers. Let σ(•) : R m → R m be an element-wise activation function (e.g. GeLU), and let W 1 ∈ R m×d , W 2 ∈ R d×m be two weight matrices and b ∈ R m be the bias vector. For an input token x ∈ R d , the MLP layer output of x is expressed as MLP(x) = x + W 2 h(x) with intermediate output h(x) = σ(W 1 LN(x)) , where LN represents layer normalization. We use ⊙ to denote element-wise multiplications. Lastly, we use L to denote the loss of the task. We study methods of reducing m, the intermediate dimension, which is usually set at m = 4d. Movement Pruning Movement Pruning (Sanh et al., 2020 ) is a popular fine-pruning method. In this paper, we focus on the block version of movement pruning (Lagunas et al., 2021) , and we first introduce the original unstructured method. Let L(W ) be the task loss with weight parameters W . For each weight parameter W i,j , we compute a accumulated score S i,j at iteration T , by the following expressionfoot_1 : Afterwards, the scores are used to compute a mask M with entries M i,j ∈ {0, 1}. And we apply the mask by S (T ) i,j = -η S t≤T W (t) i,j • ∂L(W (t) ) ∂W i,j W ′ = M ⊙ W , b ′ = M ⊙ b, and U ′ = U ⊙ M to remove the masked weights. There are two ways to compute the mask M : M = Top v (S) for hard movement and M = 1 sigmoid(S)>τ for soft movement. τ and v are both hyperparameters, and Top v (S) is defined as: Top v (S) i = 1, S i in top v%, 0, otherwise. Additionally, mask scores are regularized via a regularization term with multiplier λ mvp of the form R(S) = λ mvp i,j sigmoid(S i,j ). Hard movement prunes all layers by the same amount. Soft movement, however, allows for adaptive sparsity for different layers, which is known to be crucial for high-sparsity regimes (He et al., 2018b; Mallya et al., 2018) , and is seen to be the superior method for NLU tasks in Sanh et al. (2020) . Block pruning expands on this method to operate on groups of weights by combining mask scores per block, allowing for structured pruning (Lagunas et al., 2021) . Magnitude Pruning We use a mask version of block magnitude pruning (a block extension of group-lasso, like Shen et al. (2021) ) as the baseline. For each set G of parameters, we assign S G = ( (i,j)∈G |W i,j | 2 ) 1/2 as the mask score and gradually prune groups with the smallest scores. Gradual Random pruning Random pruning approaches have been explored previously (Yu et al., 2017; Blalock et al., 2020) , and in particular gradual, random pruning (Gale et al., 2019) has been found to perform relatively well. We further explore random pruning in conjunction with distillation. Our gradual random method freezes S at random initialization for the duration of finetuning and prunes using Top v (S).

Knowledge Distillation

In practice, pruning is often paired with knowledge distillation (Hinton et al., 2015) to boost performance. Distillation loss adds KL divergence between a teacher model and smaller student model. When used, we distill from a finetuned version of the model being pruned.

3. FINE-PRUNING FOR GENERATIVE LANGUAGE MODELS

We present our framework for understanding the redundancy of pruning methods. In this work, we focus on improving the seminal work of movement pruning proposed in Sanh et al. (2020) . However, naïvely applying movement often results in incremental or worse performance compared to random. We dissect our results using a systematic framework and analyze their behaviors and properties.

3.1. OBSERVATIONS OF PREVIOUS PRUNING METHODS

Soft Movement (BERT's Best Method) Struggles for GPT-like Models It is shown in Lagunas et al. (2021) that soft movement enjoys better performance over hard movement when block pruning encoder-decoder models. However, we find the method severely struggles when using the original implementationfoot_2 on GPT-like models due to highly sensitive hyperparameters. For instance, the mask regularization parameter λ mvp can either be too large and prune too aggressively, or too little, resulting in under-pruning as shown below. Even after grid searching λ mvp we still find subpar performance, and given the extremely high runtimes for this method as listed in Appendix A, we find this method impractical to use. Random Pruning Works Surprisingly Well One might expect movement to easily beat random pruning, however, we find their performances to only slightly differ or sometimes match, especially under distillation. Other works have noted random pruning's effectiveness (Gale et al., 2019 ), but we find the difference in generative tasks to be particularly slim. As shown in Tables 2 and 4 , random pruning performs very close to both hard and soft movement pruning over WikiSQL and Wikitext datasets. Moreover, when combined with distillation, the gaps are largely closed between random pruning and other methods, which is also another intriguing observation itself, as we discuss below. Distillation Closes the Gaps Between Different Methods As shown in Table 2 and Table 3 , methods with very different performances would perform rather similar if distilled from a non-pruned, finetuned model. Indeed, both WikiSQL and Wikitext experiments in Table 4 and Table 5 showed that when the network has fewer left-over neurons (e.g., 10% or 25%), the difference of accuracies or perplexities often fall below half of the difference without distillation. This observation remains consistent across models of different sizes, architectures, and tasks. Results for GPT-neo with 1.3billion parameters in Table 6 shows that pruning a larger model can still benefit from distillation. Knowledge distillation often boosts the performance of weaker methods even more, which might suggest the differences between methods are largely due to the inability to learn more diverse features set in fine-pruning, as suggested by the work of Allen-Zhu & Li (2020) .

3.2. TWO TYPES OF REDUNDANCY MEASURES: SENSITIVITY AND UNIQUENESS

In order to understand why these pruning methods display such behaviors, we devise a framework to characterize the leftover neurons of pruned network based on two criteria: sensitivity and uniquenessfoot_3 . Sensitivity captures how much a neuron contributes to the task objective L, while uniqueness captures how much information it provides that is not already captured by other neurons. We formalize these notions of redundancy as follows: Definition 3.1 (Redundancy Criteria) Given a set of neurons {h i (•)} i∈[m] and input X, we call one neuron h i redundant if it meets at least one of the following two conditions: 1. Sensitivity/Saliency: the neuron is not salient if its outputs are either negligible or has small gradient when optimizing for the downstream task, mathematically described as E |h i (X) • ∂L ∂h i (X) | ≈ 0 2. Uniqueness: the neuron is not unique if its outputs could be reconstructed entirely with a linear combination of the outputs from other neurons, mathematically described as h i (X) ∈ span({h j (X)} j̸ =i ), over all inputs X Intuitively, a sensitive neuron has outputs that greatly contribute to the final output, while a unique neuron has outputs which are different from that of others. These metrics are independent from one another, so a neuron could be highly salient but replaceable by other neurons, or it could be highly unique but ultimately contribute little to the network. Consider a toy example where two neurons h i and h j have the same non-zero weights and large gradient. Neuron h i could easily be removed by doubling the outputs of h j , so they are not unique, but both are highly salient. The General Trade-off Between Saliency and Uniqueness Equipped with Definition 3.1, we find the following important trends. Figure 1 and Figure 2 show that without distillation, different methods often have a preference of focus on one of the redundancy measures. We now comment on trends across all experimental result tables (Table 2,  3, 4, 5, 6, 7, 8) . The best performing methods strike a balance between both measures, establishing a strong correlative link between these metrics and final pruning performance. Under distillation, however, sensitivity seemingly concentrates across methods. Regardless of method, as more pruning occurs, sensitivity decreases and uniqueness increases in general. For individual methods, we can further dive deeper as below: • Magnitude pruning universally scores worst on both metrics, explaining its poorer performance in all experiments. However, with distillation, gaps of sensitivity between methods noticeably decreases, which partially describes why distillation improves it significantly. • Random pruning obtains similar distillation sensitivity and uniqueness, though slightly lower, to hard movement, lending credence to its overall high performance. However, sensitivity is markedly lower without distillation as is reflected in all figures. This is proof that hard movement does not target uniqueness, given random pruning does not target uniqueness. • Soft movement pruning also usually scores poorly on both metrics, and sometimes abysmally as in its sensitivity in Figure 1 , helping describe its overall poor performance. • Hard movement pruning consistently obtains the highest sensitivity with not-far-behind uniqueness across different datasets and architectures. This correlates with the high performance when distillation is not used. However, when combined with distillation, the gaps of sensitivity between methods converge, and the advantage of hard movement fades. • GUM, our proof-of-concept method, nearly always obtains best uniqueness while maintaining decent sensitivity, further improved using distillation, explaining its superiority across various tasks. However, GUM has a larger performance increase for GPT-Neo-125m than for GPT-2-sm on WikiSQL; this is explained in Figure 2 as pruned GPT-2 already has high baseline uniqueness for WikiSQL (∼2x) so further increase incurs diminishing returns. Given the training/validation split and general noise in the datasets, there are some outlier points, for instance, GUM's surprisingly poor distilled uniqueness for Wikitext in Figure 1 . We observe higher absolute uniqueness on Wikitext in general (around 95% of neurons are unique per cosine similarity), meaning uniqueness varies over datasets and improving uniqueness is difficult.

4. GLOBALLY UNIQUE MOVEMENT

After observing the lack of uniqueness amongst leftover neurons, we set out to improve the performance of hard movement. We introduce two techniques which together comprise Globally Unique Movement (GUM). In essence, we encourage a score-weighted uniqueness term by multiplying the score regularizer and the cosine similarity together, to obtain a balance of uniqueness and sensitivity. Tackling Non-Unique Redundancy Regularizing or pruning via similarity is a well-explored topic (Ayinde et al., 2019; Zhu et al., 2018; Srinivas & Babu, 2015; Santacroce et al., 2020) and existing techniques would increase uniqueness. However, we integrate more cleanly with movement to insulate weights from regularization, with a small increase in training time as listed in Appendix A. Our approach regularizes mask scores based on cosine similarityfoot_4 . Cosine similarity between the outputs of any two neurons given input X (for example, a batch of tokens) is defined simply as sim(h i (X), h j (X)) = h i (X) ⊤ h j (X) ∥h i (X)∥ 2 * ∥h j (X)∥ 2 (3) Algorithm 1 Running Cosine Similarity Update Require: a set of neurons h i (•) for i ∈ [m], inputs from a set Z (usually intermediate outputs of attention layers), an update multiplier λ sim ; 1: Initialize running similarity sim (0) (h i , h j ) = 0 for i, j ∈ [m], running inner products C (0) i,j = 0, and running output vector norms Q (0) i = 0 for i ∈ [m]; 2: while still training do 3: Sample a input X ∈ Z, compute the output vector neuron h j (X) for j ∈ [m] 7 . 4: Update C (t+1) i,j ← (1 -λ sim )C (t) i,j + λ sim • h i (X) ⊤ h j (X), ∀i, j ∈ [m] 5: Update Q (t+1) i ← (1 -λ sim )Q (t) i + λ sim • ∥h i (X)∥ 2 2 , ∀i ∈ [m] 6: Update similarity by: sim (t+1) (h i , h j ) ← C (t+1) i,j / Q (t+1) i Q (t+1) j , ∀i, j ∈ [m] 7: end while However, calculating similarity with only intra-batch estimates is noisy and unreliable, so we introduce a running version of its estimates in Algorithm 1 to obtain cosine similarity sim (t) (h i , h j ) between neurons h i (•) and h j (•). Now, we build on the regularization term R sim (S) of movement pruning to define a new regularization: Let N left be the number of leftover neurons, for each group j ∈ [m] and its corresponding score S j , we define a term U j = 1 N left i∈[m] ,i̸ =j sim(h j , h i ), and then we multiply U j to the original terms in R(S) to obtain R sim (S) = λ mvp j U j • sigmoid(S j ) Global Top v for Soft-Like Movement Hard movement removes the same amount of weights per layer independently. Global Top v instead uses Top v function on the set of all mask scores in the network jointly. Global Top v was originally explored for movement (Sanh et al., 2020) and was found to perform similarly. We find when used in conjunction with uniqueness regularization, Global outperforms Local. Global Top v intuitively allows for more flexibility when pruning. When pruning locally, it is necessary to choose the least common pruning percent -if one layer requires 50% neurons before severe degradation, all layers must keep 50%. Global comparison removes this loophole in a similar manner to soft movementfoot_5 .

5. RESULTS

In this section, we present results on three different kinds of generative language modeling tasks: language modeling with Wikitext-103 Merity et al. (2016) , text-to-text generation and natural language understanding with SAMsum Gliwa et al. (2019) , and exact match measurable text-to-code generation with WikiSQL Zhong et al. (2017) . Details and hyperparameters are listed in Appendix F. When distilling, the teacher model used is the finetuned version of the model. To ensure trends hold when scaling up, we present one experiment with GPT-Neo-1.3b in section 5. For all pruning amounts, we will present in terms of final percentage leftover -i.e., 75% of neurons remain after pruning. For soft movement, final prune percentage is shown in parentheses when it differs from desired by a large amount. In general, GUM is found to outperform Top v by a margin similar to the difference between Top v and gradual random pruning, with some exceptions. While small, we argue this gap shows the effectiveness of preserving neuron uniqueness alongside saliency. Wikitext-103 Results on the Wikitext-103 dataset Merity et al. (2016) , one of the most popular datasets for causal language modeling, are shown in Tables 4 and 5 . Because performance on Wikitext-103 is in perplexity (PPL), it is a highly consistent and discriminatory dataset to prune on. We are unable to ever fully recover original model performance after pruning, suggesting that any compression increases uncertainty. WikiSQL As opposed to the other datasets, WikiSQL Zhong et al. ( 2017) contains hard groundtruth labels for comparison via Exact Match (EM). Due to this, our best performance is achieved in WikiSQL, where GUM is able to remove up to 75% of neurons while maintaining performance on GPT-Neo. Results are shown in Tables 2 and 3 . We also present results for GPT-Neo-1.3b only on WikiSQL in Table 6 . Results for this experiment follow a similar trend to smaller models. SAMsum Results on SAMsum Gliwa et al. (2019) are presented in Tables 7 and 8 . Popular for encoder-decoder models, this dataset entails summarizing short instant message conversations. Larger generative models have been explored for this task (Feng et al., 2021; Zhang et al., 2019) , achieving competitive results. We use this dataset to test the natural language understanding and summarization skills of small models under pruning. We note poor relative baseline results relative to encoder-decoder models as expected, however, pruning trends follow that of other datasets and GUM generally outperforms Top v .

6. ADDITIONAL RELATED WORKS

General Domain Pruning Neural net pruning has been proposed years before the explosion of deep learning research (Janowsky, 1989; Mozer & Smolensky, 1988; Karnin, 1990) , and are summarized in an outdated survey Reed (1993) . Previous works have explored many approaches of pruning neural nets (Wen et al., 2016; Han et al., 2015b; a; Li et al., 2016) . Recently, the lottery ticket hypothesis Frankle & Carbin (2018) proposed a new direction to prune at initialization instead. However, there is also a massive divergence of methods or claims that it is not worthwhile Liu et al. (2018) ; Blalock et al. (2020) . Regardless, many strong techniques exist in modern incarnations across all kinds of architectures (Yang et al., 2016; Luo et al., 2017; He et al., 2018a 

B E2E NLG CHALLENGE RESULTS

We also tested pruning on the E2E NLG Challenge Dušek et al. (2020) in Tables 12 and 13 . A highly popular dataset, performance in this domain is measured by a variety of metrics that all measure the quality of the output indirectly as opposed to the direct measurement versus ground truth in other experiments. After many rounds of hyperparameter optimization, results for GPT-2-sm loosely follow previously seen trends, however, GPT-Neo-125m results are highly inconsistent. For this model removing even 90% of neurons can result in best performance, and increasing pruning does not monotonically affect performance, two highly concerning phenomenon unseen in other datasets. We speculate these inconsistencies could be due to many reasons; such as the open-endedness of the problem domain, or too little training data. Measuring the performance of generative models is a non-trivial task, especially when model outputs are highly similar as is the case when pruning. For these reasons, we do not further experiment on this dataset and leave it out of our main analysis. However, we include this experiment to show that in some cases, pruning results can be inconsistent for language models. Further exploration is required into this area. 

C UNIQUENESS REGULARIZATION PER LAYER

The goal of uniqueness regularization is to punish similarity between neurons, especially those with high similarity. Figure 3 shows that GUM is generally successful in removing neurons with high similarity across all layers. Figure 4 shows the prune percentages per layer for one sample training run. From this test, later layers are clearly prioritized, with a large emphasis on the last layer. While the exact layers pruned more or less will vary with noise, model, and dataset, we observe this trend to generally hold true.

E UNIQUENESS AND SENSITIVITY GRAPHING

To measure sensitivity for a model, we measure the global sum of sensitivity for all neurons {h i (•)} i∈[m] in feedforward layers {l ∈ L} on the training dataset via (taking the absolute value to account for sign): To measure uniqueness for a model, we measure the cosine similarity of each neuron with each other neuron as in equation 3 over the entire dataset, measuring each pair only once. This can be measured using the running cosine similarity without a decay multiplier. Then, we measure the percentage of neurons with at least 1 similarity above 0.8 to another neuron (i.e. these neurons nearly match each other) across the entire network. Σ l Σ i |h i (X) • ∂L ∂h i (X) | Both sensitivity and uniqueness are not useful on their own, but are useful relative to other models. We therefore divide both metrics by the value obtained for the fully finetuned version of the model. Overall, uniqueness values for Wikitext-103 and for WikiSQL were quite different. Neurons on Wikitext-103 seem to already be highly unique, with some specific layers containing a large amount of redundancy, while WikiSQL has redundancy throughout the entire network. 

F TRAINING HYPERPARAMETERS

All training hyperparameters are provided. Only one random seed was attempted per training run. We note that it is possible different pruning amounts require different hyperparameters and therefore performance could be better, however, searching for each possible combination would be far too



All code publicly available at (removed for peer-review). Gradients are calculated straight-through to the mask scores, otherwise it is undefined(Bengio et al., 2013). Soft movement is the best to our knowledge at time of writing. Code is available at https://github. com/huggingface/nn_pruning. In this code, mask scores are added directly to the optimizer and are affected by optimizer algorithm or other hyperparameters. We use this code for a fair comparison between architectures, but manually updating the mask according to the definition might help(Zhang et al., 2022). These are two known concepts in literature, but have not been both combined into one pruning method. Solving for linear combinations of neurons during training is prohibitively expensive, so we consider cosine similarity as a "first-order" proxy. Appendix D shows an example pruning distribution for one pruned network.



Figure1: Sensitivity and Uniqueness measured on the training set for GPT-Neo-125m. The vertical axis is defined as the ratio of the corresponding metric between the pruned model and a baseline model (which is non-pruned and fully fine-tuned) with a maximum of 1x. We are able to use these graphs to analyze and compare the performance of different pruning methods. Details of measurements are given in Appendix E.

Figure2: Sensitivity and Uniqueness measured on the training set for GPT-2-sm. The vertical axis is defined as the ratio of the corresponding metric between the pruned model and a baseline model (which is non-pruned and fully fine-tuned) with a maximum of 1x. Details of measurements are given in Appendix E.

Figure 4: The percentage leftover for layers 1-12 after Global Top v pruning, using GUM training on WikiSQL with GPT-Neo-125m. Total leftover neurons is exactly 25% of all neurons.

Figures5 and 6show non-re-scaled results for the previous graphs. These graphs show uniqueness as a percent of all neurons (i.e. 50% of neurons are unique) and sensitivity as a raw value.

Figure 5: Sensitivity and Uniqueness measured without re-scaling, for GPT-2-sm.

Comparison of pruning methods used. R(S) is defined in Section 2 and Rsim(S) is described in Eq 4.

: Performance in Acc lf on the validation set for decreasing amount leftover on WikiSQL. GUM outperforms compared to other methods, soft movement struggles to match other methods, and gradual random nearly performs as well as Top v . * indicates having 1-3% excess leftover neurons unpruned.

GPT

Distillation generally hurts performance across all methods. GPT-Neo-125m: Performance in perplexity (PPL) on the validation set for decreasing amount leftover on Wikitext-103. * indicates having 1-3% excess leftover neurons unpruned.

Performance in perplexity (PPL) on the validation set for decreasing amount leftover on Wikitext-103. * indicates having 1-3% excess leftover neurons unpruned.

GPT-Neo-1.3B: Performance in Acc lf for decreasing amount leftover on WikiSQL.

). SAMsum Training Runtime in Hours. GPT-Neo-125m and GPT-2-sm were run on 8xV100 GPUs. All results are averaged over all pruning runs, which are comparable given neuron removal occurs after training.

: Testing results on the E2E NLG Challenge. Higher is better for all metrics. Even at 10% leftover, performance is similar to the baseline, for both Top v and GUM.

Testing results on the E2E NLG Challenge. Higher is better for all metrics. In general, more pruning results in worse performance, and GUM outperforms Top v .

For each layer, this graph shows the percentage of neurons with at least one similarity per range. Similarity is defined as the absolute value of cosine similarity over the entire validation dataset, increasing from 0 to 1. Top v and GUM are compared, training on WikiSQL with GPT-Neo-125m. Total leftover neurons is exactly 25% of all neurons.D GLOBAL TOP v REMOVAL PER LAYERA natural question arises with Global Top v pruning: what is the final prune percentage per-layer?

Method

Leftover Li et al. (2022) find ways to prune and finetune these models on downstream data. Building on automated gradual pruning (Zhu & Gupta, 2017) and learned threshold pruning (Azarian et al., 2020) , movement pruning (Sanh et al., 2020) and further block movement (Lagunas et al., 2021) have become highly popular methods for fine-pruning, combining finetuning with pruning for overall best performance. Since then, many works have attempted to improve on movement (Yao et al., 2021; Zhang et al., 2022; Xia et al., 2022; Kwon et al., 2022) . As previously mentioned, however, we are unable to find any comparable works systematically exploring structured pruning for decoder-only models.

7. CONCLUSION & FUTURE WORK

In this paper, we have performed an evaluation of structured pruning on generative language models, finding existing methods to improve over random pruning less than expected. In addition, we have proposed a framework for analyzing pruning methods in terms of uniqueness and saliency, two important criteria for preserving model quality and diversity. We have presented a novel method based on these metrics, GUM, for structured pruning of generative models, based on uniqueness regularization and global Top v pruning. Our method can be applied to the MLP layers of various generative models, but there are still many open questions and challenges for pruning other components, such as attention heads or MoE modules. We also acknowledge the limitations of our method, which can reduce saliency and performance, suggesting possible directions for improving uniqueness pruning. Our work is an initial step towards understanding and improving structured pruning of generative models.

A COMPUTATIONAL RUNTIME COMPARISON

The training runtime of all pruning methods are compared in Tables 9, 10, and 11. For all experiments, Soft movement has a greatly increased runtime compared to other pruning methods. This is due to the over-pruning problem previously described: if soft movement prunes too many neurons, it must default to hard movement as a backup. To reach a specific pruning percentage, this must occur, resulting in significantly more computation.GUM is certainly slower than hard movement, however, we find the difference to be minimal. Compared to no pruning, all pruning methods add a significant amount of time to training, especially when also combined with distillation. Therefore, the additional runtime incurred by GUM is small in comparison. expensive. Therefore, the same hyperparameters are used for different pruning percentages. Only the best combination for each model is listed, but each model had its hyperparameters tuned manually.

Model

We explored hyperparameters by starting with known vaules from literature, then performing grid searches on relevant pruning vaules (i.e. distillation temperature, or λ gum ).L1 regularization was used on the mask scores for all models. Learning weight decayed linearly to 0 for all models. LR means Learning Rate, WD means Weight Decay, and LS means Label Smoothing.When distillation was used, the teacher was trained using the same hyperparameters (sans pruning arguments).Adam β 1 Adam β 2 Adam ϵ Batch LR Warmup Percent GUM λ c 0.9 0.999 1e-4 8 10% 0.99 Table 14 : Shared hyperparameters for all runs.WikiSQL A max token length of 512 was used. Strings were not converted to lowercase. Special tokens were added by the tokenizer.GPT-Neo-125m used a mask LR of 1e-2 for Hard Movement/GUM and 1e1 for Soft Movement. GPT-2-sm used a mask LR of 1e-2 for Hard Movement/GUM and 1e1 for Soft Movement. Soft movement required a large mask learning rate as it would otherwise not converge -other combinations of more regularization were attempted. SAMsum A max token length of 1024 was used. Strings were not converted to lowercase, and whitespace was not stripped. More than 3 or 4 epochs starts to result in overtraining for both models, so both were limited to not overtrain. For GPT-Neo-125m, both methods used a mask LR of 1e-2. For GPT-2-sm, GUM required a mask LR of 1e-3, while TopK used a mask LR of 1e-2. However, for both models λ gum must be 1e1 for no distillation, and 1e2 for distillation, as too high a λ gum results in poor non-distil performance. 

