PROGRESSIVE PROMPTS: CONTINUAL LEARNING FOR LANGUAGE MODELS

Abstract

We introduce Progressive Prompts -a simple and efficient approach for continual learning in language models. Our method allows forward transfer and resists catastrophic forgetting, without relying on data replay or a large number of task-specific parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the base model frozen. Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.

1. INTRODUCTION

Learning a long sequence of tasks while gaining experience and avoiding forgetting remains a key feature of human-level intelligence. Although pretrained language models have largely succeeded in learning on a single task, their performance degrades in scenarios where multiple tasks are encountered sequentially, also known as continual learning (CL) (de Masson D'Autume et al., 2019; Huang et al., 2021) . Two major challenges arise in CL: (1) avoiding catastrophic forgetting, i.e., loss of the knowledge acquired from previous tasks after learning new ones (McCloskey & Cohen, 1989; Ratcliff, 1990) , and (2) allowing forward transfer, i.e., leveraging the knowledge from past tasks for efficient learning of new tasks. Typical CL approaches for language models train a model on all tasks, which ensures forward transfer but also leads to forgetting. These methods use data replay or add regularization constraints (Huang et al., 2021; de Masson D'Autume et al., 2019; Sun et al., 2019) , but they still suffer from forgetting due to inevitable changes in parameters shared between tasks. Other approaches, such as progressive networks (Rusu et al., 2016) , can alleviate catastrophic forgetting completely while supporting forward transfer, but are computationally expensive because they add a new copy of the model for each task. This can be especially intractable for large-scale language models with billions of parameters, which have become a standard in the NLP field (Zhang et al., 2022) . In this paper, we introduce Progressive Prompts -a novel CL approach for language models that supports forward transfer without forgetting. Our method is inspired by progressive networks, but is significantly more memory-efficient because it only learns a fixed number of tokens, or prompt, for each new task. Learning a prompt to adapt language models on a single downstream task was introduced in prompt tuning (Lester et al., 2021) , and was shown to match the performance of full model finetuning while training <0.01% of the parameters. In Progressive Prompts, we learn a separate prompt for each incoming task and sequentially concatenate it with previously learned prompts. Importantly, we share input tokens across all tasks and progressively prepend new prompts while keeping previous prompts frozen (see Figure 1 ). Our method can: 1) alleviate catastrophic Figure 1 : Illustrating our proposed method Progressive Prompts and contrasting it with a simple adaptation of progressive networks using prompt tuning. In the simple adaptation of progressive networks we learn a separate prompt and repeat the frozen input embeddings for each new task. This setup requires repeating input tokens for each task. In Progressive Prompts we use the same input and progressively append new prompt for each new task. Prior task prompts are not modified by the addition of new prompts. forgetting by preserving the knowledge acquired in previous prompts, and 2) transfer knowledge to future tasks by sequentially learning new prompts given previous ones. We also introduce a new technique for prompt embedding reparameterization (Li & Liang, 2021) . We show that by passing the prompt embeddings through a residual MLP we can stabilize prompt tuning and improve its performance. We run extensive experiments on standard CL benchmarks for text classification, and show that Progressive Prompts outperforms state-of-the-art approaches on both BERT and T5 architectures (Devlin et al., 2018; Raffel et al., 2020) . We show over 20% improvement over the current SOTA for T5 model (Qin & Joty, 2021) . Furthermore, we run experiments on a more challenging CL setup with longer task sequences, and show that our method outperforms prior approaches for T5 and BERT architectures. Our main contributions in this paper are as follows: • We propose a novel CL approach, Progressive Prompts, that alleviates catastrophic forgetting and supports knowledge transfer to future tasks -all while learning < 0.1% of the total parameters. • Progressive Prompts is suitable for any transformer-based architecture. We show that it significantly outperforms prior SOTA methods on standard CL benchmarks for both BERT and T5 models. • We propose a more challenging CL setup encompassing 15 text classification tasks and show that our method significantly outperforms prior methods.

2.1. FINETUNING

The predominant technique for adapting the language model to a downstream task T is finetuning, when all parameters Θ of the model (initialized from some pre-trained weights) are updated during task adaptation (Devlin et al., 2018; Zhang et al., 2020) . Consider a classification task T with input text x, and output scalar label y. Here p Θ is a probability distribution of output classes parameterized by the weights, Θ, of the language model. In finetuning,

