PARAMETER-EFFICIENT TRANSFER LEARNING WITH DIFF PRUNING

Abstract

While task-specific finetuning of deep networks pretrained with self-supervision has led to significant empirical advances in NLP, their large size makes the standard finetuning approach difficult to apply to multi-task, memory-constrained settings, as storing the full model parameters for each task become prohibitively expensive. We propose diff pruning as a simple approach to enable parameterefficient transfer learning within the pretrain-finetune framework. This approach views finetuning as learning a task-specific "diff" vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. The diff vector is adaptively pruned during training with a differentiable approximation to the L 0 -norm penalty to encourage sparsity. Diff pruning becomes parameter-efficient as the number of tasks increases, as it requires storing only the nonzero positions and weights of the diff vector for each task, while the cost of storing the shared pretrained model remains constant. We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task.

1. INTRODUCTION

Task-specific finetuning of pretrained deep networks has become the dominant paradigm in contemporary NLP, achieving state-of-the-art results across a suite of natural language understanding tasks (Devlin et al., 2019; Liu et al., 2019c; Yang et al., 2019; Lan et al., 2020) . While straightforward and empirically effective, this approach is difficult to scale to multi-task, memory-constrained settings (e.g. for on-device applications), as it requires shipping and storing a full set of model parameters for each task. Inasmuch as these models are learning generalizable, task-agnostic language representations through self-supervised pretraining, finetuning the entire model for each task is an especially inefficient use of model parameters. A popular approach to parameter-efficiency is to learn sparse models for each task where a subset of the final model parameters are exactly zero (Gordon et al., 2020; Sajjad et al., 2020; Zhao et al., 2020; Sanh et al., 2020) . Such approaches often face a steep sparsity/performance tradeoff, and a substantial portion of nonzero parameters (e.g. 10%-30%) are still typically required to match the performance of the dense counterparts. An alternative is to use multi-task learning or feature-based transfer for more parameter-efficient transfer learning with pretrained models (Liu et al., 2019b; Clark et al., 2019; Stickland & Murray, 2019; Reimers & Gurevych, 2019; Feng et al., 2020) . These methods learn only a small number of additional parameters (e.g. a linear layer) on top of a shared model. However, multi-task learning generally requires access to all tasks during training to prevent catastrophic forgetting (French, 1999) , while feature-based transfer learning (e.g. based on taskagnostic sentence representations) is typically outperformed by full finetuning (Howard & Ruder, 2018) . Adapters (Rebuffi et al., 2018) have recently emerged as a promising approach to parameterefficient transfer learning within the pretrain-finetune paradigm (Houlsby et al., 2019; Pfeiffer et al., 2020a; b; c) . Adapter layers are smaller, task-specific modules that are inserted between layers of a pretrained model, which remains fixed and is shared across tasks. These approaches do not require access to all tasks during training making them attractive in settings where one hopes to obtain and share performant models as new tasks arrive in stream. Houlsby et al. (2019) find that adapter layers trained on BERT can match the performance of fully finetuned BERT on the GLUE benchmark (Wang et al., 2019a) while only requiring 3.6% additional parameters (on average) per task. In this work, we consider a similar setting as adapters but propose a new diff pruning approach with the goal of even more parameter-efficient transfer learning. Diff pruning views finetuning as learning a task-specific difference vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. In order to learn this vector, we reparameterize the task-specific model parameters as θ task = θ pretrained +δ task , where the pretrained parameter vector θ pretrained is fixed and the task-specific diff vector δ task is finetuned. The diff vector is regularized with a differentiable approximation to the L 0 -norm penalty (Louizos et al., 2018) to encourage sparsity. This approach can become parameter-efficient as the number of tasks increases as it only requires storing the nonzero positions and weights of the diff vector for each task. The cost of storing the shared pretrained model remains constant and is amortized across multiple tasks. On the GLUE benchmark (Wang et al., 2019a) , diff pruning can match the performance of the fully finetuned BERT baselines while finetuning only 0.5% of the pretrained parameters per task, making it a potential alternative to adapters for parameter-efficient transfer learning.

2. BACKGROUND: TRANSFER LEARNING FOR NLP

The field of NLP has recently seen remarkable progress through transfer learning with a pretrainand-finetune paradigm, which initializes a subset of the model parameters for all tasks from a pretrained model and then finetunes on a task specific objective. Pretraining objectives include context prediction (Mikolov et al., 2013 ), autoencoding (Dai & Le, 2015) , machine translation (McCann et al., 2017) , and more recently, variants of language modeling (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019) objectives. Here we consider applying transfer learning to multiple tasks. We consider a setting with a potentially unknown set of tasks, where each τ ∈ T has an associated training set {x (n) τ , y (n) τ } N n=1 . 1 For all tasks, the goal is to produce (possibly tied) model parameters θ τ to minimize the empirical risk, min θτ 1 N N n=1 L f (x (n) τ ; θ τ ), y (n) τ + λR(θ τ ) where f (•; θ) is a parameterized function over the input (e.g. a neural network), L(•, •) is a loss function (e.g. cross-entropy), and R(•) is an optional regularizer with hyperparameter λ. This multi-task setting can use the pretrain-then-finetune approach by simply learning independent parameters for each task; however the large size of pretrained models makes this approach exceedingly parameter inefficient. For example, widely-adopted models such as BERT BASE and BERT LARGE have 110M and 340M parameters respectively, while their contemporaries such as T5 (Raffel et al., 2020 ), Megatron-LM (Shoeybi et al., 2019 ), and Turing-NLG (Rajbhandari et al., 2019) have parameter counts in the billions. Storing the fully finetuned models becomes difficult even for a moderate number of tasks. 2 A classic approach to tackling this parameterinefficiency (Caruana, 1997) is to train a single shared model (along with a task-specific output layer) against multiple tasks through joint training. However, the usual formulation of multi-task learning requires the set of tasks T to be known in advance in order to prevent catastrophic forgetting (French, 1999), 3 making it unsuitable for applications in which the set of tasks is unknown (e.g. when tasks arrive in stream).

3. DIFF PRUNING

Diff pruning formulates task-specific finetuning as learning a diff vector δ τ that is added to the pretrained model parameters θ pretrained . We first reparameterize the task-specific model parameters, θ τ = θ pretrained + δ τ , 1 Therefore our setup is different from the classic multitask setting which usually assumes that set of tasks is known 2 An intriguing line of work suggests that large-scale language models can be used without finetuning for a variety of tasks if given the appropriate context (Radford et al., 2019; Brown et al., 2020) . While interesting, these models generally underperform task-specific models and require billions of parameters, though recent work suggests that they can be made substantially smaller (Schick & Schutze, 2020) . 3 However, work on continual learning mitigates these issues to an extent (Shin et al., 2017; Lopez-Paz & Ranzato, 2017; Lee et al., 2017; Kirkpatrick et al., 2017; Parisi et al., 2018) .

