MVP: MULTI-TASK SUPERVISED PRE-TRAINING FOR NATURAL LANGUAGE GENERATION

Abstract

Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e., "supervised pre-training") showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model's capacity to perform a specific task. Extensive experiments have demonstrated the effectiveness and generalizability of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on 13 out of 17 datasets.

1. INTRODUCTION

Natural language generation (NLG, also known as text generation) is a crucial capacity for language intelligence, which aims to generate human-like texts on demand (Garbacea & Mei, 2020) . Since the emergence of the pre-training and fine-tuning paradigm, pre-trained language models (PLMs) have dominated the mainstream approaches for NLG tasks (Lewis et al., 2020; Brown et al., 2020) . With a large-scale general corpus, the majority of PLMs are pre-trained in an unsupervised (self-supervised) manner by leveraging intrinsic data correlations as supervision signals. However, unsupervised pre-training is likely to incorporate noise that affects the performance of downstream tasks (Feng et al., 2022) , also leading to a slower rate of acquiring knowledge (Zhang et al., 2021) . In the meanwhile, more and more large-scale labeled datasets have become easily accessible (Deng et al., 2009; Liu et al., 2020) . There is growing evidence that pre-training with labeled data can further improve the performance of PLMs, both in the fields of computer vision (He et al., 2016; Dosovitskiy et al., 2021) and natural language processing (Lin et al., 2020b; Su et al., 2022) . These promising developments motivate us to consider pre-training text generation models with labeled data, which is called "supervised pre-training" (Feng et al., 2022) . Existing work has shown that supervised pre-training can explicitly learn task-specific characteristics and alleviate the discrepancy between unsupervised pre-training and supervised fine-tuning (Sanh et al., 2022; Lin et al., 2020b) . Furthermore, most NLG systems are often trained in a supervised way, requiring supervision signals to learn the input-to-output transformation. For example, dialogue systems learn to generate appropriate responses based on historical utterances, and text summarization systems learn to extract essential information from long documents according to human-written summaries. Therefore, we suspect that supervised pre-training is more suited for NLG-oriented PLMs in essence since it can provide task-related instructions early in the pre-training stage instead of a later fine-tuning stage. Inspired by the recent success of supervised pre-training, we propose Multi-task superVised Pretraining (MVP) for natural language generation by leveraging a variety of labeled text generation datasets. Specially, we collect a large-scale labeled corpus, MVPCorpus, consisting of 77 datasets over 11 text generation tasks. Since recent research shows that an extensive scale of multi-task pretraining (Aribandi et al., 2022) is the key to generalizing to new tasks for large PLMs, we combine these labeled datasets for multi-task pre-training. Existing popular works, as shown in Table 1 , mainly 1

