PARAMETER-EFFICIENT TRANSFER LEARNING WITH DIFF PRUNING

Abstract

While task-specific finetuning of deep networks pretrained with self-supervision has led to significant empirical advances in NLP, their large size makes the standard finetuning approach difficult to apply to multi-task, memory-constrained settings, as storing the full model parameters for each task become prohibitively expensive. We propose diff pruning as a simple approach to enable parameterefficient transfer learning within the pretrain-finetune framework. This approach views finetuning as learning a task-specific "diff" vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. The diff vector is adaptively pruned during training with a differentiable approximation to the L 0 -norm penalty to encourage sparsity. Diff pruning becomes parameter-efficient as the number of tasks increases, as it requires storing only the nonzero positions and weights of the diff vector for each task, while the cost of storing the shared pretrained model remains constant. We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task.

1. INTRODUCTION

Task-specific finetuning of pretrained deep networks has become the dominant paradigm in contemporary NLP, achieving state-of-the-art results across a suite of natural language understanding tasks (Devlin et al., 2019; Liu et al., 2019c; Yang et al., 2019; Lan et al., 2020) . While straightforward and empirically effective, this approach is difficult to scale to multi-task, memory-constrained settings (e.g. for on-device applications), as it requires shipping and storing a full set of model parameters for each task. Inasmuch as these models are learning generalizable, task-agnostic language representations through self-supervised pretraining, finetuning the entire model for each task is an especially inefficient use of model parameters. A popular approach to parameter-efficiency is to learn sparse models for each task where a subset of the final model parameters are exactly zero (Gordon et al., 2020; Sajjad et al., 2020; Zhao et al., 2020; Sanh et al., 2020) . Such approaches often face a steep sparsity/performance tradeoff, and a substantial portion of nonzero parameters (e.g. 10%-30%) are still typically required to match the performance of the dense counterparts. An alternative is to use multi-task learning or feature-based transfer for more parameter-efficient transfer learning with pretrained models (Liu et al., 2019b; Clark et al., 2019; Stickland & Murray, 2019; Reimers & Gurevych, 2019; Feng et al., 2020) . These methods learn only a small number of additional parameters (e.g. a linear layer) on top of a shared model. However, multi-task learning generally requires access to all tasks during training to prevent catastrophic forgetting (French, 1999) , while feature-based transfer learning (e.g. based on taskagnostic sentence representations) is typically outperformed by full finetuning (Howard & Ruder, 2018) . Adapters (Rebuffi et al., 2018) have recently emerged as a promising approach to parameterefficient transfer learning within the pretrain-finetune paradigm (Houlsby et al., 2019; Pfeiffer et al., 2020a; b; c) . Adapter layers are smaller, task-specific modules that are inserted between layers of a pretrained model, which remains fixed and is shared across tasks. These approaches do not require access to all tasks during training making them attractive in settings where one hopes to obtain and share performant models as new tasks arrive in stream. Houlsby et al. (2019) find that adapter layers trained on BERT can match the performance of fully finetuned BERT on the GLUE benchmark (Wang et al., 2019a) while only requiring 3.6% additional parameters (on average) per task.

