MVP: MULTI-TASK SUPERVISED PRE-TRAINING FOR NATURAL LANGUAGE GENERATION

Abstract

Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e., "supervised pre-training") showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model's capacity to perform a specific task. Extensive experiments have demonstrated the effectiveness and generalizability of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on 13 out of 17 datasets.

1. INTRODUCTION

Natural language generation (NLG, also known as text generation) is a crucial capacity for language intelligence, which aims to generate human-like texts on demand (Garbacea & Mei, 2020) . Since the emergence of the pre-training and fine-tuning paradigm, pre-trained language models (PLMs) have dominated the mainstream approaches for NLG tasks (Lewis et al., 2020; Brown et al., 2020) . With a large-scale general corpus, the majority of PLMs are pre-trained in an unsupervised (self-supervised) manner by leveraging intrinsic data correlations as supervision signals. However, unsupervised pre-training is likely to incorporate noise that affects the performance of downstream tasks (Feng et al., 2022) , also leading to a slower rate of acquiring knowledge (Zhang et al., 2021) . In the meanwhile, more and more large-scale labeled datasets have become easily accessible (Deng et al., 2009; Liu et al., 2020) . There is growing evidence that pre-training with labeled data can further improve the performance of PLMs, both in the fields of computer vision (He et al., 2016; Dosovitskiy et al., 2021) and natural language processing (Lin et al., 2020b; Su et al., 2022) . These promising developments motivate us to consider pre-training text generation models with labeled data, which is called "supervised pre-training" (Feng et al., 2022) . Existing work has shown that supervised pre-training can explicitly learn task-specific characteristics and alleviate the discrepancy between unsupervised pre-training and supervised fine-tuning (Sanh et al., 2022; Lin et al., 2020b) . Furthermore, most NLG systems are often trained in a supervised way, requiring supervision signals to learn the input-to-output transformation. For example, dialogue systems learn to generate appropriate responses based on historical utterances, and text summarization systems learn to extract essential information from long documents according to human-written summaries. Therefore, we suspect that supervised pre-training is more suited for NLG-oriented PLMs in essence since it can provide task-related instructions early in the pre-training stage instead of a later fine-tuning stage. Inspired by the recent success of supervised pre-training, we propose Multi-task superVised Pretraining (MVP) for natural language generation by leveraging a variety of labeled text generation datasets. Specially, we collect a large-scale labeled corpus, MVPCorpus, consisting of 77 datasets over 11 text generation tasks. Since recent research shows that an extensive scale of multi-task pretraining (Aribandi et al., 2022) is the key to generalizing to new tasks for large PLMs, we combine these labeled datasets for multi-task pre-training. Existing popular works, as shown in Table 1 , mainly To summarize, our main contributions center around the following research questions: • How to train an NLG-oriented PLM in a supervised pre-training way? In order to prepare the supervised corpus, we collect a massive labeled MVPCorpus, consisting of 77 datasets over 11 NLG tasks across various domains and specific objectives. To the best of our knowledge, MVPCorpus is the largest collection of NLG datasets. Firstly, we formulate different NLG tasks as a general text-to-text form so that the supervised corpus can be used in a unified way for pre-training an NLG model. Our work presents a simple yet general approach for pre-training a more capable NLG model by leveraging various labeled NLG datasets. • Can supervised pre-trained NLG models be both effective and general? Extensive experiments show that the supervised pre-trained MVP outperforms its unsupervised pre-trained counterpart BART in both full tuning (+7.0% on avarege) and parameter-efficient tuning (+4.3% on avarege) settings. Our MVP model achieves state-of-the-art performance on 13 out of 17 datasets. Furthermore, the experiments on unseen NLG and NLU tasks demonstrate that our supervised MVP model has a strong generalization ability for unseen tasks. For reproducing and reusing our work, we release the collection MVPCorpus, the models (e.g., MVP, task-specific prompts, and multi-task variants), and codes for pre-training and fine-tuning at the link: https://anonymous.4open.science/r/ICLR-2023-Paper3518/.

2. RELATED WORK

Pre-trained Language Models. Pre-trained language models have achieved exceptional success in a wide range of tasks, and the majority of them are pre-trained in an unsupervised manner (Brown et al., 2020; Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020) . For example, with large-scale plain texts as the unsupervised pre-training corpus, GPT-3 (Brown et al., 2020) employ language modeling as the pre-training task, i.e., predicting the next token conditioned on previous tokens; BART (Lewis et al., 2020) learns to recover the original text from corrupted text which has been altered by arbitrary noise transformations. GPT-3 and BART use 570GB and 160GB of unlabeled text as the pre-training corpora, respectively. In the meanwhile, the computer vision community benefits a lot from the labeled dataset ImageNet (Deng et al., 2009) . Influential models, such as ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2021) , leverage ImageNet for pre-training. Inspired by the success of pre-training with labeled data, machine translation researchers explore supervised pre-training (McCann et al., 2017; Lin et al., 2020b) . Lin et al. (2020b) attempt to pre-train a translation model with parallel data in multiple languages. Despite using much less pre-trained data, mRASP still achieves better performance than translation models pre-trained in an unsupervised manner (Lample & Conneau, 2019; Liu et al., 2020) . In this paper, we propose to pre-train a universal NLG model in a supervised manner with collections of labeled datasets (23GB).



Representative PLMs for NLG and NLU tasks using (un)supervised pre-training. We present a more detailed comparison and discussion about supervised pre-training in Section 6.Sanh et al., 2022; Aribandi et al., 2022)  or use unsupervised pre-training(Lewis  et al., 2020; Raffel et al., 2020), with no consideration of supervised pre-training on NLG tasks. To fill this gap, we explore supervised pre-training and multi-task learning for deriving both effective and general NLG models.To develop our approach, we adopt a Transformer-based(Vaswani et al., 2017)  sequence-to-sequence model as the pre-training backbone. In multi-task training, different tasks may "neutralize" the ability learned through other tasks(He & Choi, 2021). To mitigate this potential issue, we propose to learn task-specific prompts based on the MVP model, following the structure of prefix-tuning(Li &  Liang, 2021). Task-specific pre-training enables prompts to "store" specialized knowledge for each corresponding task. Integrating MVP with task-specific prompts can further stimulate the model's capacity to perform some specific tasks.

