MVP: MULTI-TASK SUPERVISED PRE-TRAINING FOR NATURAL LANGUAGE GENERATION

Abstract

Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e., "supervised pre-training") showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model's capacity to perform a specific task. Extensive experiments have demonstrated the effectiveness and generalizability of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on 13 out of 17 datasets.

1. INTRODUCTION

Natural language generation (NLG, also known as text generation) is a crucial capacity for language intelligence, which aims to generate human-like texts on demand (Garbacea & Mei, 2020) . Since the emergence of the pre-training and fine-tuning paradigm, pre-trained language models (PLMs) have dominated the mainstream approaches for NLG tasks (Lewis et al., 2020; Brown et al., 2020) . With a large-scale general corpus, the majority of PLMs are pre-trained in an unsupervised (self-supervised) manner by leveraging intrinsic data correlations as supervision signals. However, unsupervised pre-training is likely to incorporate noise that affects the performance of downstream tasks (Feng et al., 2022) , also leading to a slower rate of acquiring knowledge (Zhang et al., 2021) . In the meanwhile, more and more large-scale labeled datasets have become easily accessible (Deng et al., 2009; Liu et al., 2020) . There is growing evidence that pre-training with labeled data can further improve the performance of PLMs, both in the fields of computer vision (He et al., 2016; Dosovitskiy et al., 2021) and natural language processing (Lin et al., 2020b; Su et al., 2022) . These promising developments motivate us to consider pre-training text generation models with labeled data, which is called "supervised pre-training" (Feng et al., 2022) . Existing work has shown that supervised pre-training can explicitly learn task-specific characteristics and alleviate the discrepancy between unsupervised pre-training and supervised fine-tuning (Sanh et al., 2022; Lin et al., 2020b) . Furthermore, most NLG systems are often trained in a supervised way, requiring supervision signals to learn the input-to-output transformation. For example, dialogue systems learn to generate appropriate responses based on historical utterances, and text summarization systems learn to extract essential information from long documents according to human-written summaries. Therefore, we suspect that supervised pre-training is more suited for NLG-oriented PLMs in essence since it can provide task-related instructions early in the pre-training stage instead of a later fine-tuning stage. Inspired by the recent success of supervised pre-training, we propose Multi-task superVised Pretraining (MVP) for natural language generation by leveraging a variety of labeled text generation datasets. Specially, we collect a large-scale labeled corpus, MVPCorpus, consisting of 77 datasets over 11 text generation tasks. Since recent research shows that an extensive scale of multi-task pretraining (Aribandi et al., 2022) is the key to generalizing to new tasks for large PLMs, we combine these labeled datasets for multi-task pre-training. Existing popular works, as shown in Table 1 , mainly Table 1 : Representative PLMs for NLG and NLU tasks using (un)supervised pre-training. We present a more detailed comparison and discussion about supervised pre-training in Section 6.

Settings Supervised Pre-training

Unsupervised Pre-training NLG MVP (ours) GPT-2, GPT-3, BART, T5, UniLM, MASS, PEGASUS NLU FLAN, T0, Muppet, ExT5 BERT, RoBERTa, T5, UniLM, XLNet, ELECTRA focus on NLU tasks (Sanh et al., 2022; Aribandi et al., 2022) or use unsupervised pre-training (Lewis et al., 2020; Raffel et al., 2020) , with no consideration of supervised pre-training on NLG tasks. To fill this gap, we explore supervised pre-training and multi-task learning for deriving both effective and general NLG models. To develop our approach, we adopt a Transformer-based (Vaswani et al., 2017) sequence-to-sequence model as the pre-training backbone. In multi-task training, different tasks may "neutralize" the ability learned through other tasks (He & Choi, 2021 ). To mitigate this potential issue, we propose to learn task-specific prompts based on the MVP model, following the structure of prefix-tuning (Li & Liang, 2021) . Task-specific pre-training enables prompts to "store" specialized knowledge for each corresponding task. Integrating MVP with task-specific prompts can further stimulate the model's capacity to perform some specific tasks. To summarize, our main contributions center around the following research questions: • How to train an NLG-oriented PLM in a supervised pre-training way? In order to prepare the supervised corpus, we collect a massive labeled MVPCorpus, consisting of 77 datasets over 11 NLG tasks across various domains and specific objectives. To the best of our knowledge, MVPCorpus is the largest collection of NLG datasets. Firstly, we formulate different NLG tasks as a general text-to-text form so that the supervised corpus can be used in a unified way for pre-training an NLG model. Our work presents a simple yet general approach for pre-training a more capable NLG model by leveraging various labeled NLG datasets. • Can supervised pre-trained NLG models be both effective and general? Extensive experiments show that the supervised pre-trained MVP outperforms its unsupervised pre-trained counterpart BART in both full tuning (+7.0% on avarege) and parameter-efficient tuning (+4.3% on avarege) settings. Our MVP model achieves state-of-the-art performance on 13 out of 17 datasets. Furthermore, the experiments on unseen NLG and NLU tasks demonstrate that our supervised MVP model has a strong generalization ability for unseen tasks. For reproducing and reusing our work, we release the collection MVPCorpus, the models (e.g., MVP, task-specific prompts, and multi-task variants), and codes for pre-training and fine-tuning at the link: https://anonymous.4open.science/r/ICLR-2023-Paper3518/.

2. RELATED WORK

Pre-trained Language Models. Pre-trained language models have achieved exceptional success in a wide range of tasks, and the majority of them are pre-trained in an unsupervised manner (Brown et al., 2020; Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020) . For example, with large-scale plain texts as the unsupervised pre-training corpus, GPT-3 (Brown et al., 2020) employ language modeling as the pre-training task, i.e., predicting the next token conditioned on previous tokens; BART (Lewis et al., 2020) learns to recover the original text from corrupted text which has been altered by arbitrary noise transformations. GPT-3 and BART use 570GB and 160GB of unlabeled text as the pre-training corpora, respectively. In the meanwhile, the computer vision community benefits a lot from the labeled dataset ImageNet (Deng et al., 2009) . Influential models, such as ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2021) , leverage ImageNet for pre-training. Inspired by the success of pre-training with labeled data, machine translation researchers explore supervised pre-training (McCann et al., 2017; Lin et al., 2020b) . Lin et al. (2020b) attempt to pre-train a translation model with parallel data in multiple languages. Despite using much less pre-trained data, mRASP still achieves better performance than translation models pre-trained in an unsupervised manner (Lample & Conneau, 2019; Liu et al., 2020) . In this paper, we propose to pre-train a universal NLG model in a supervised manner with collections of labeled datasets (23GB). Multi-task Learning. Our pre-training process is also related to multi-task learning (MTL), a method of mixing multiple tasks into a single training process (Collobert & Weston, 2008) . A model trained with MTL can benefit from helpful knowledge of relevant tasks, resulting in improved performance (McCann et al., 2018; Subramanian et al., 2018) . Recently, MT-DNN (Liu et al., 2019a) and Muppet (Aghajanyan et al., 2021) collect tens of datasets in the multi-task procedure and achieve better performance in downstream tasks. The pre-finetuning schema proposed in Muppet shares a similar idea with our study. Aribandi et al. (2022) further combine the denoising pre-training task of T5 (Raffel et al., 2020) and multi-task learning to pre-train a new model, ExT5. MTL has also contributed to sub-fields of text generation, such as open-ended dialogue system (Zhang et al., 2020) , task-oriented dialogue system (Su et al., 2022) , text style transfer (Bujnowski et al., 2020) , and question answering (Khashabi et al., 2020) . At the same time, researchers explore the transferability of models trained on multi-task datasets (Mishra et al., 2022) . FLAN (Wei et al., 2022) , T0 (Sanh et al., 2022) , and ZeroPrompt (Xu et al., 2022) investigate the zero-shot generalization abilities of large PLMs trained on numerous task datasets with well-designed prompts. Ye et al. (2021) develop a benchmark CrossFit to study the few-shot learning ability of models. Compared with these works, we aim to explore multi-task learning to derive both effective and general NLG models. Prompt Learning. Prompt learning is a thriving method in the field of NLP. Prompt learning converts fine-tuning text into a format similar to pre-training to leverage implicit pre-training knowledge and alleviate the discrepancy between pre-training and fine-tuning (Liu et al., 2021b) . GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2020) add human-written task prompts to the input text. For instance, T5 prepends "Summarize:" to the input document for summarization tasks; GPT-3 (Brown et al., 2020) further combines several demonstrations to input to learn task patterns, which is called in-context learning. Some researchers also design elaborate prompts or demonstrations for each task and dataset and investigate their effectiveness and robustness (Wei et al., 2022; Sanh et al., 2022; Xu et al., 2022; Mishra et al., 2022) . To overcome the constraints of manually constructed prompts, researchers develop continuous (soft) prompts that can be optimized in the continuous space (Lester et al., 2021; Qin & Eisner, 2021) . Prefix-tuning (Li & Liang, 2021) increases the number of parameters in prompts and employs prompting in each Transformer layer. Gu et al. (2022) propose PPT to pre-train continuous prompts using unlabeled data. SPoT (Vu et al., 2022) and UnifiedSKG (Xie et al., 2022) learn the prompts on related tasks and transfer them to new tasks.

3. THE MVP MODEL

This section introduces our MVP model: a Multi-task superVised Pre-trained model for natural language generation. We first collect a large-scale NLG corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. After that, we pre-train our MVP model using a mixture of labeled data from MVPCorpus. We further learn the task-specific prompts to stimulate the MVP model to perform a certain task. The overview of our model is illustrated in Figure 1 .

3.1. DATA COLLECTION

Formally, the natural language generation (NLG) task aims to generate a sequence of tokens Y = (y 1 , y 2 , . . . , y n ) conditioned on input data X (e.g., a piece of text or structured data) (Li et al., 2022) . Typically, NLG tasks are categorized according to the data format of X and Y. For example, text summarization condenses a long document into a brief text containing essential information; datato-text generation produces descriptive text about structured input; and a dialogue system creates pertinent responses given multiple dialog utterances. In this paper, we collect a large-scale labeled MVPCorpus consisting of 77 labeled datasets from 11 representative NLG tasks 7 . For evaluation, we utilize the rest 27 datasets which are more commonly used in the literature. Among these datasets, 23 datasets are from the 7 tasks used in pre-training. We refer to them as seen tasks and use them to test the effectiveness of our model. The remaining 4 datasets are from the tasks of commonsense generation, paraphrase generation, simplification, and style transfer, respectively. We call them unseen tasks and use them to examine the generalization ability of our model.

3.2. MODEL ARCHITECTURE

We pre-train our MVP model and task-specific prompts in two stages. In the first stage, we pre-train the MVP backbone using a mixture of labeled datasets from seven tasks to learn general text-to-text relationships and transferable semantic information across tasks. To indicate each task, we apply human-written prompts to each task instance. For example, we write "Summarize:" as the prompt for summarization tasks. The manual prompts for each task are shown in Appendix E. In the second stage, we freeze our MVP backbone and pre-train a set of task-specific soft prompts (i.e., continuous vectors) to stimulate the model's capacity to perform some specific tasks. We learn them using a mixture of corresponding intra-task datasets (i.e., datasets under the same taskfoot_1 ). These soft prompts, which are not shared between tasks, encode the task-specific semantic knowledge to alleviate the blurring-out problem induced by multi-task learning (He & Choi, 2021) . Specifically, we employ the standard Transformer encoder-decoder (Vaswani et al., 2017) as our backbone. Compared to decoder-only architectures such as GPT-3 (Brown et al., 2020) and prefix LMs such as UniLM (Dong et al., 2019) , the encoder-decoder architecture is more effective for text generation tasks (Raffel et al., 2020) . As for task-specific soft prompts, we insert continuous vectors at each Transformer layer, following prefix-tuning (Li & Liang, 2021) . Compared to prompt tuning (Lester et al., 2021) , which only adds trainable embeddings to the input layer, the layer-wise prompting of prefix-tuning is more effective and stable (Liu et al., 2022) , especially for NLG tasks.

3.3. TRAINING DETAILS

Our MVP model adopts a Transformer with 12 layers in both encoder and decoder (406M parameters), the same as the model size of BART LARGE (Lewis et al., 2020) . The hidden size is 1,024 and the inner hidden size of the feed-forward network is 4,096. We employ the byte-pair-encoding (BPE) tokenizer, and the vocabulary size is 50,267. We initialize the backbone with the BART parameters to provide a good starting point for NLG tasks following previous work (Dong et al., 2019; Zhang et al., 2020) . We pre-train the model with a batch size of 8,192 and adopt a temperature-scaled mixing strategy (Raffel et al., 2020) with a rate of T = 2 to mitigate the disparity in tasks and datasets. We follow prefix-tuning (Li & Liang, 2021) to pre-train task-specific prompts by prepending trainable continuous vectors to the keys and values of the multi-head attention module at each layer. The prompt length is set to 100, and we utilize the MLP reparameterization function with a hidden size of 800 to improve the training robustness and performance (Li & Liang, 2021) . Hence, every task prompts have approximately 62M parameters. Then, we freeze the MVP model and train seven groups of task-specific prompts, each of which corresponds to a different task. The batch size is set to 8,192, and we leverage the mixing strategy with a rate of T = 2. In the two stages, the maximum length of both input and output sequences is set to 1,024 for supporting examples to contain more tokens. We optimize the model with a constant learning rate of 3 × 10 -5 using standard sequence-to-sequence cross-entropy loss. We apply the AdamW optimizer (Loshchilov & Hutter, 2019) with β 1 = 0.9, β 2 = 0.98, ϵ = 1 × 10 -6 to improve training stability (Liu et al., 2019b) . The weight decay coefficient is 0.1. For testing, we select the checkpoint with the highest validation performance. All the experiments are conducted on 32 NVIDIA Tesla V100 32GB GPUs. We implement our model using the library Hugging Face (Wolf et al., 2020) . In summary, we pre-train a 406M text generation model MVP and seven groups of 62M task-specific prompts. For each downstream task, users can either utilize the MVP backbone (406M) directly or further combine MVP with task-specific prompts (468M).

4. EXPERIMENT RESULTS

In this section, we mainly investigate the effectiveness of our proposed supervised pre-training for NLG. Specifically, we fine-tune our MVP model on new datasets for pre-trained (seen) generation tasks under full tuning and parameter-efficient tuning settings. For the full tuning setting, we fine-tune the entire model (including the backbone MVP and prompts), while for the parameter-efficient tuning, we only fine-tune prompts but freeze the parameter weights of MVP. We optimize the model via the seq2seq loss with label smoothing (Szegedy et al., 2016) factor of 0.1 and the AdamW optimizer with default hyper-parameters. We sweep over the batch size in {16, 64, 256} and the learning rate in {5 × 10 -6 , 1 × 10 -5 , 3 × 10 -5 } to find the optimal hyperparameters for each evaluation task. We utilize the checkpoint with the best validation performance for test set inference. During inference, we set the beam size to 5 and the no-repetitive ngram size to 3. For evaluation, we leverage the automatic generation metrics BLEU (Papineni et al., 2002) , ROUGE (Lin, 2004) , and METEOR (Banerjee & Lavie, 2005) to measure the quality of the generated text and employ Distinct (Li et al., 2016) to evaluate its diversity. Details regarding fine-tuning and evaluation can be found in Appendix C. We conduct extensive experiments with in different settings. Under full tuning scenarios, we employ the 23 datasets from 7 seen tasks for evaluation. Section 4.1 and Appendix D analyze the performance of our methods on these datasets. To better compare with ExT5 (Aribandi et al., 2022) , we conduct experiments on the GEM benchmark (Gehrmann et al., 2021) in Appendix D.2. Under parameterefficient tuning settings, we utilize the same datasets as in Section 4.1 and the results can be found in Section 4.2. Furthermore, we evaluate our models without fine-tuning and compare them with T0 (Sanh et al., 2022) in Appendix D.3. These extensive results show that our MVP model consistently outperforms various baselines in different scenarios, which demonstrates the effectiveness of our supervised pre-training method for NLG.

4.1. FULL TUNING PERFORMANCE

We design several model variants to verify the effectiveness of our two-stage pre-training method proposed in Section 3.2. For the first-stage model MVP that uses multi-task supervised pre-training, we compare it with two competitive backbones using different pre-training strategies: • BART LARGE (Lewis et al., 2020) : BART is a widely-used PLM for natural language generation using unsupervised pre-training task, i.e., denoising auto encoder. • Single-task pre-training (Single): We individually train a single model for each task using intratask datasets under the same pre-training settings in multi-task training. For instance, we pre-train a summarization model using summarization datasets (e.g., Newsroom, WikiHow, and MSNews). Therefore, we have seven single-task pre-trained models in total. For the second-stage model that integrates pre-trained task-specific prompts (denoted by MVP+S), we compare it with two variants using different prompts: • Randomly initialized prompts (MVP+R): The layer-wise prompts for the MVP model are randomly initialized without pre-training. • Multi-Task pre-trained prompts (MVP+M): We only pre-train one group of prompts for all tasks, using the same mixed datasets as in the backbone pre-training. Besides these variants, we further include the best-reported results from original papers in the literature for comparison (denoted as SOTA). From the results in Table 2 , we can see that: First, supervised pre-training models (i.e., MVP and Single) achieve better performance than the unsupervised pre-trained model BART, yielding an average improvement of 7.0% and 4.4% (in ratio), respectively. This finding demonstrates the effectiveness of our supervised pre-training method. With Table 3 : The main results on seven seen tasks under parameter-efficient settings. We also include the results of BART and MVP under the full tuning setting (denoted as FT) for comparison. labeled datasets, supervised pre-training enables the model to acquire more task-specific information, thus leading to improved results on downstream tasks. Regarding multi-task pre-training (MVP) and single-task (Single), our MVP model outperforms its single-task counterparts by 2.7%. This result indicates that the proposed multi-task learning approach can enhance single-task performance by learning transferable semantic information across tasks. Methods CNN/DailyMail WebNLG SQuAD (QG) CoQA R-1 R-2 R-L B-4 ME R-L B-4 ME R Second, task-specific prompt learning is effective to alleviate the "blurring-out" issue of multi-task learning. For tasks such as data-to-text generation and question answering, MVP with single-task prompt pre-training (MVP+S) consistently outperforms the other two variants (MVP+R, MVP+M). This verifies that task-specific prompts can acquire specialized knowledge of each task and stimulate the capacity of the MVP model to perform certain tasks. Finally, our supervised pre-training approach achieves five new SOTA results on data-to-text generation, question generation, question answering, story generation, and open-ended dialogue tasks in Table 2 . We also achieve SOTA performance in six out of eight datasets in Table 9 , which shows the strong text generation capability of our MVP model. As for the remaining tasks, the SOTA models incorporate specific techniques tailored to the tasks, e.g., the re-ranking framework (Ravaut et al., 2022) and various task-specific objectives (He et al., 2022) , which yield better performance than our models. In contrast, the results of our models are very competitive, which is developed based on a general architecture and a unified learning objective.

4.2. PARAMETER-EFFICIENT TUNING PERFORMANCE

In the lightweight fine-tuning setting, we only tune the prompts while freezing the backbone MVP model. Besides our MVP+S model, we consider comparing the following methods: • Prefix-tuning (Li & Liang, 2021) : Prefix-tuning is a popular prompt-based lightweight tuning method for text generation. We employ BART LARGE as its backbone, denoted as BART+R. • Only tuning randomly initialized prompts (MVP+R): This variant only tunes the randomly initialized prompts of MVP+R, and it shares a similar idea with prefix-tuning. • Only tuning multi-task pre-trained prompts (MVP+M): This variant only tunes the multi-task pre-trained prompts of MVP+M. Such an idea has been used in SPoT (Vu et al., 2022) . From the experimental results in Table 3 , we can see that: the good performance of the MVP model in lightweight settings further demonstrates the effectiveness of supervised pre-training. By comparing two randomly initialized prompting methods (BART+R and MVP+R), we can see that MVP+R achieves superior performance to BART+R (+2.0%) due to its multi-task supervised backbone. Furthermore, when initialized with pre-trained prompts, MVP+S and MVP+M achieve improved results over MVP+R, which is consistent with the findings of SPoT (Vu et al., 2022) . When compared with MVP+M, MVP+S performs marginally better by 1.2%, indicating that task-specific prompts are useful to improve the model in specific generation tasks. Surprisingly, our lightweight MVP+S can even outperform fully tuned BART on tasks such as question generation and question answering, showcasing the effectiveness of the proposed supervised pre-training approach. Another note is that lightweight prompting methods (Lester et al., 2021; Vu et al., 2022 ) that work on NLU tasks cannot achieve competitive performances when compared to full tuning methods on NLG tasks.

5. GENERALIZATION ABILITY

In this section, we test our MVP model on unseen NLG and NLU tasks to verify the generalizability. Generalization to Unseen NLG Tasks. According to Deng et al. (2021) , an NLG task can be assigned to one of the following three categories: compression (e.g., summarization), transduction (e.g., translation), or creation (e.g., story generation). Since we do not include any transduction tasks during pre-training, we evaluate our MVP model using two unseen transduction NLG tasks: paraphrase generation and text style transfer. We select the SOTA methods for these two tasks, i.e., AESOP (Sun et al., 2021) for paraphrase generation and SC & BLEU (Lai et al., 2021) for text style transfer, and replace their backbone BART with our MVP model for comparison. The experimental setup remains the same as described in Section 4, and details are reported in Appendix C. From the results in Table 4 , we can see that our model outperforms BART by a ratio of 2.2% and achieves two new SOTA results, which verifies the strong generalizability of our model. This finding shows that our MVP model is more capable than BART and can serve as a general yet effective backbone to solve more specific tasks by providing superior parameter initialization. Generalization to Unseen NLU Tasks. Although MVP is designed especially for NLG tasks, we also evaluate its performance on unseen NLU tasks using the widely-used GLUE benchmark (Wang et al., 2019) . We compare our model to BART LARGE using its original sequence classification method (Lewis et al., 2020) . The detailed settings can be found in Appendix C. According to the results presented in Table 5 , our MVP model outperforms BART on 9 of 12 metrics and has a superior overall performance of 0.71%. This result indicates the strong generalization ability of our MVP model and further demonstrates that our supervised pre-training not only learns generation ability but also improves the overall semantic representations. 

6. DISCUSSION

Differences with Existing Methods. To the best of our knowledge, existing supervised pre-training works mainly focus on NLU tasks (Aghajanyan et al., 2021; Aribandi et al., 2022) or a small number of NLG tasks (Lin et al., 2020b; Su et al., 2022) . Given the superior performance achieved by supervised pre-training approaches, it is important to explore supervised pre-training for deriving both effective and general NLG models. Our work makes a significant contribution in this direction, achieving SOTA performance with a single model on 13 of 17 datasets. Compared with its strong counterpart ExT5 (Aribandi et al., 2022) , our MVP model outperforms it in 26 out of 27 metrics (detailed in Appendix D.2). In order to better understand the difference between our paper with previous supervised (multi-task) pre-training studies, we present a detailed comparison in Table 6 . As we can see, our work conducts the study with the largest number of NLG tasks for both supervised pre-training and fine-tuning, incorporates task-specific prompts, and also releases all the important resources for reproducing or reusing our work. Applicability. To facilitate the application of our work, we have released the collection corpus, pre-trained models, task-specific prompts, and the generated texts. Our collected MVPCorpus is the largest NLG task collection. We can use all the data to pre-train a general model or select a subset to continue pre-training a domain-or task-specific model (Gururangan et al., 2020) . Our MVPCorpus can also be considered as the evaluation benchmark for different NLG tasks. Furthermore, our MVP model can be used to achieve new state-of-the-art results in various NLG tasks. Users can either fine-tune the MVP model or integrate it with task-specific prompts to achieve better results based on sufficient labeled data. Even in data-scarce domains, our MVP model can be also directly employed to obtain good performance without fine-tuning. In addition, our MVP model can provide effective parameter initialization for improving existing methods, as described in Section 5. Finally, the pre-trained task-specific prompts and the generated texts can be further used to study the task similarity and their effect on the multi-task pre-training.

7. CONCLUSION

In this paper, we present Multi-task superVised Pre-training (MVP) for natural language generation. Firstly, we collect a large-scale NLG corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. After converting various NLG tasks into a unified text-to-text format, we propose multi-task supervised pre-training to learn an effective and general model MVP with task-specific prompts for NLG tasks. Extensive experiments have demonstrated that: (1) supervised pre-training is beneficial for NLG tasks as a general solution. Our MVP model outperforms the unsupervised pre-trained model BART and even achieves SOTA performance on 13 out of 17 datasets; (2) supervised pre-trained models have strong generalization ability on unseen generation or even understanding tasks. In future work, we will explore the multilingual version of our MVP model by covering more datasets in other languages. Such a model is expected to capture language-independent task characteristics and improve the generation tasks in the minority language. Besides, it is interesting to study how different tasks relate to each other in the unified semantic space of the MVP model, which can inspire methods that incorporate task relations as prior.

BROADER IMPACTS

In this paper, we pre-trained a language model MVP using labeled NLG datasets. According to the research (Bender et al., 2021; Bommasani et al., 2021) , PLMs tend to "remember" what they have "seen" in pre-training corpus. This could result in the reproduction of undesirable biases from pre-training data on downstream tasks. Training data intervention could be a solution to alleviate this issue (Lu et al., 2020) . It is also interesting to investigate whether supervised pre-training produces fewer biases than unsupervised pre-training in the future. Environmental impact is another factor we should consider. We have attempted a more efficient pre-training strategy and released our PLM for future work. In contrast to large PLMs with tens of billions of parameters, such as T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020) , we pre-train only a small model with hundreds of millions of parameters. In addition, we utilize supervised pretraining data and initialize our model with pre-trained BART, both of which improve the convergence of our model. Ultimately, our model is pre-trained for about 20, 000 steps, whereas BART of the same size is pre-trained for 500, 000 steps.

REPRODUCIBILITY

For reproducing and reusing our work, we have released the collection MVPCorpus, the models (e.g., MVP, task-specific prompts and multi-task variants), intermediate results (e.g., the generated texts), and source codes for pre-training and fine-tuning at the link: https://anonymous.4open. science/r/ICLR-2023-Paper3518/. The detailed settings of experiments are listed in Appendix C. We hope that these open-source resources will facilitate future work on supervised pre-training and contribute to the advancement of NLG research. For the experiments of the GEM benchmark in Appendix D.2 (Table 10 ), the fine-tuning settings are the same as those described in Section 4. We use BLEU-4, ROUGE-2, and METEOR for evaluation. We use the GEM evaluation scriptsfoot_3 . Methods XSum SAMSum CoQA QG R-1 R-2 R-L R-1 R-2 R-L B-4 ME R

D ADDITIONAL RESULTS

In this section, we provide additional results of our MVP model and other baselines.

D.1 RESULTS OF COMMON DATASETS

We also conduct experiments on eight common datasets under full tuning settings. Due to space limits in Section 4, these results are shown in Table 9 . We can see that these results share a similar trend to those in Section 4, and we achieve SOTA performances in 6 of 8 datasets.

D.2 RESULTS ON THE GEM BENCHMARK

To better compare with ExT5 (Aribandi et al., 2022) , we conduct experiments on the GEM benchmark (Gehrmann et al., 2021) . For "unseen" commonsense generation and text simplification tasks, we utilize prompts of data-to-text generation and summarization, respectively. The results are presented in Table 10 , and our MVP models outperform ExT5 in 26 out of 27 metrics.

D.3 RESULTS WITHOUT FINE-TUNING

Considering our MVP model has already been pre-trained on several tasks, we conduct experiments on these "seen" tasks without fine-tuning our model. To some degree, this setting can be viewed as Table 10 : The results on the GEM benchmark under full tuning settings. We utilize the large version of T5.1.1 and ExT5, and all the results of them are from Aribandi et al. (2022) . 1 34.31 45.22 36.30 42.57 46.60 38.20 39.79 49.90 36.80 ExT5 36.62 48.14 37.60 42.25 46.70 38.10 40.14 50.33 36.90 MVP 39.13 48.92 38.53 37.38 47.96 39.39 50.58 55.24 41.27 MVP+S 38.83 48.49 38.41 37.32 47.40 38.90 50.69 55.52 Table 11 : The results on seven seen tasks without fine-tuning. Given that T0 has been pre-trained on the CNN/DailyMail dataset, we exclude their results to provide a fair comparison (denoted as "-"). zero-shot learning. Nonetheless, it does not conform to the definition of true zero-shot settings (Perez et al., 2021) . To avoid controversy, we refer to this as without fine-tuning.

DART E2E ToTTo

B-4 R-2 ME B-4 R-2 ME B-4 R-2 ME T5.1. We include T0-3B (Sanh et al., 2022) as our baseline. The results are listed in Table 11 . Our MVP model outperforms T0 in all metrics with a large margin. However, all tasks demonstrate that methods without fine-tuning perform significantly worse than those with full tuning settings. This suggests that zero-shot strategies that are effective for NLU tasks may not produce satisfactory results for NLG tasks. Even though our model has acquired task knowledge, it struggles to perform well in a new domain without being fine-tuned. Thus, we focus mainly on full tuning settings in this paper. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court's treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What's objectionable is the attempts to undermine international justice, not Palestine's decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court's decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. 

Input

Describe the following data: Abilene, Texas -cityServed -Abilene Regional Airport Gold Abilene, Texas is served by the Abilene regional airport. Abilene Regional Airport serves the city of Abilene in Texas. BART Abilene Regional Airport serves the city of Abilene in Texas. MVP Abilene Regional Airport serves the city of Abilene, Texas. MVP+S Abilene Regional Airport serves the city of Abilene, Texas. Table 15 : The second instance from the WebNLG dataset, which has three gold target sentences. Input Describe the following data: "Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas" -location -Adolfo Suárez Madrid-Barajas Airport Gold Adolfo Suárez Madrid-Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas. Adolfo Suarez Madrid-Barajas airport is located at Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas. Adolfo Suarez Madrid-Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastian de los Reyes and Alcobendas.

BART

Adolfo Suárez Madrid-Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

MVP

Adolfo Suárez Madrid-Barajas Airport can be found in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas.

MVP+S

Adolfo Suárez Madrid-Barajas Airport is located in Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas. Table 18 : The first instance from the CoQA dataset. Input Answer the following question: what color was cotton ? [X SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ' s horses slept . but cotton wasn ' t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ' s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ' s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ' s mommy rubbed her face on cotton ' s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ' s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ' s fur was all all dry . " don ' t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ' t want that ! " then cotton thought , " i change my mind . i like being special " .

Gold white

BART white MVP white

MVP+S white

Table 19 : The second instance from the CoQA dataset. Input Answer the following question: what color was cotton ? [SEP] white [X SEP] where did she live ? [X SEP] once upon a time , in a barn near a farm house , there lived a little white kitten named cotton . cotton lived high up in a nice warm place above the barn where all of the farmer ' s horses slept . but cotton wasn ' t alone in her little home above the barn , oh no . she shared her hay bed with her mommy and 5 other sisters . all of her sisters were cute and fluffy , like cotton . but she was the only white one in the bunch . the rest of her sisters were all orange with beautiful white tiger stripes like cotton ' s mommy . being different made cotton quite sad . she often wished she looked like the rest of her family . so one day , when cotton found a can of the old farmer ' s orange paint , she used it to paint herself like them . when her mommy and sisters found her they started laughing . " what are you doing , cotton ? ! " " i only wanted to be more like you " . cotton ' s mommy rubbed her face on cotton ' s and said " oh cotton , but your fur is so pretty and special , like you . we would never want you to be any other way " . and with that , cotton ' s mommy picked her up and dropped her into a big bucket of water . when cotton came out she was herself again . her sisters licked her face until cotton ' s fur was all all dry . " don ' t ever do that again , cotton ! " they all cried . " next time you might mess up that pretty white fur of yours and we wouldn ' t want that ! " then cotton thought , " i change my mind . i like being special " . 



We do not consider machine translation tasks but only focusing on English tasks in this work. For instance, we train summarization-specific prompts using summarization datasets (e.g., Newsroom(Grusky et al., 2018), WikiHow(Koupaee & Wang, 2018), and MSNews(Liu et al., 2021a)). https://www.kaggle.com/c/quora-question-pairs https://github.com/GEM-benchmark/GEM-metrics



Figure 1: The overview of the pre-training process of our MVP model and task-specific prompts.

The main results on seven seen tasks under full tuning settings. The best and second-best results among all the methods are marked in bold and underlined, respectively. The SQuAD dataset here is used for the question generation task. The letters B, R, D, and ME denote BLEU, ROUGE, Distinct, and METEOR, respectively. "-" means the work does not compute the corresponding result. These setups and abbreviations are the same below. a (Ravaut et al., 2022) b (Ke et al., 2021) c (Bao et al., 2021) d (Xiao et al., 2020) e (Lewis et al., 2020) f (Liu et al., 2021a) g (Guan et al., 2021) h (Chen et al., 2022) i (He et al., 2022) j(Lin et al., 2020c)

The main results of unseen NLG tasks. We use AESOP and SC & BLEU to denote the methods proposed bySun et al. (2021) andLai et al. (2021), respectively. Accuracy is calculated by a pre-trained TextCNN to evaluate the style strength, and HM denotes the harmonic mean of BLEU-4 and style accuracy. a(Sun et al., 2021) b(Lai et al., 2021)

The main results of NLU tasks on the GLUE benchmark. We evaluate the results on the official website https://gluebenchmark.com/. Matt. means the Matthews correlation coefficient. Acc. stands for the accuracy rate. P/S Corr. denote Pearson and Spearman correlation coefficients. m./mm. refer to the accuracy of the matched and mismatched domains. Avg. is a macro-average of scores defined inWang et al. (2019).

Comparison of our work with existing supervised pre-training methods. #NLG/#NLU denote the number of NLG and NLU tasks, respectively. PT denotes pre-training, FT denotes fine-tuning and SP denotes supervised pre-training.

The statistics and licenses of datasets for pre-training our MVP model. The #Train, #Valid, and #Test denote the number of examples in the train, valid, and test sets, respectively. Cleaned #Train represents the number of training examples after filtering. Input and Output are the average number of words (split by space) in the input and output sequences, respectively. These setups and abbreviations are the same below.

The statistics and licenses of datasets for evaluating our MVP model. The license of the MNLI dataset is composed of OANC, CC BY-SA 3.0, and CC BY 3.0. The license of the CoQA dataset is composed of CC BY-SA 4.0, MSR-LA, and Apache 2.0. The license of the WiA-A/T datasets is composed of CC BY-NC 3.0, CC BY-NC 4.0, and GNU General Public License v3.0.

The results on six seen tasks under full tuning settings. a (Nguyen et al., 2021) b (Tang et al., 2022) c (Gu et al., 2021) d (Lewis et al., 2020) e (Guan et al., 2021) f (Chen et al., 2022) g (Chen et al., 2020b) h (Raffel et al., 2020) i (Xu et al., 2021)

41.29   .88 48.19 91.73 83.46 57.34 18.80 22.84 21.95  MVP+S 70.37 70.65 47.70 91.12 83.59 56.95 18.52 22.57 22.02

The second instance from the CNN/Daily Mail dataset. The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development.

The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN's Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.GoldMembership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis. BART Palestinian Authority becomes 123rd member of the International Criminal Court. The move gives the court jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the Palestinians' efforts to join the body.

The first instance from the WebNLG dataset, which has two gold target sentences.

The first instance from the PersonaChat dataset. Input Given the dialog: i love to meet new people . [SEP] i have a turtle named timothy . [SEP] my favorite sport is ultimate frisbee . [SEP] my parents are living in bora bora . [SEP] autumn is my favorite season . [X SEP] hello , how are you doing tonight ? Gold i am well an loving this interaction how are you ? BART i am doing well , how are you ? MVP i am doing well . how are you ? MVP+S i am doing well . how are you ?

The second instance from the PersonaChat dataset. Input Given the dialog: i just bought a brand new house . [SEP] i like to dance at the club . [SEP] i run a dog obedience school . [SEP] i have a big sweet tooth . [SEP] i like taking and posting selkies . [X SEP] hello , how are you doing tonight ? [SEP] i am well an loving this interaction how are you ? Gold i am great . i just got back from the club . BART i am doing well , thanks for asking . what do you do for fun ? MVP i am doing well . i just got home from dancing at the club . MVP+S i am doing well . i just bought a new house .

A LIMITATIONS

Despite our efforts to collect as many generation tasks and datasets as possible, we only evaluate the generation quality and generalization ability of our models on a small number of tasks and datasets. The interpretability and robustness of our models require further analysis. Besides, there exists subjectivity when collecting intra-task datasets, albeit our attempts to employ widely-recognized categorizations from the literature. Due to limitation of computing power, we do not study the performance of our method at different model scales. The effectiveness of multi-task pre-training from scratch, similar to ExT5 (Aribandi et al., 2022) , also merits an in-depth study. Regarding evaluation methods, we only consider basic automatic metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) . However, there is still a certain gap between these metrics and human judgments (Sai et al., 2022) .

B TASKS AND DATASETS B.1 DESCRIPTION OF TASKS AND DATASETS

We provide the details of the tasks and datasets used in our paper for pre-training and fine-tuning in Tables 7 and 8 . If the dataset for pre-training does not have a valid set, we divide 10% of the training set for validation.We list the licenses for all datasets if them have. All datasets are publicly available. The majority of them can be directly downloaded from GitHub or Google Drive. ROCStories (Mostafazadeh et al., 2016) and CommonGen (Lin et al., 2020a ) can be obtained after filling out a form. GYAFC (Rao & Tetreault, 2018) is accessible after requesting Yahoo and the authors of the dataset.The tasks and datasets we use in this paper are as follows:• Data-to-text generation aims to generate descriptive text about structured data, such as the knowledge graph and the table. We use the following datasets for pre-training:C FINE-TUNING AND EVALUATION DETAILSIn this section, we introduce the details for fine-tuning and evaluating each downstream task.For the experiments in Section 4 (Tables 2 and 3 ), and Appendix D.1 (Table 9 ), the fine-tuning details are introduced in Section 4, and the evaluation details are presented as follows:• For data-to-text generation tasks, we use BLEU(-4), ROUGE-L, and METEOR for evaluation. We use the script provided by Chen et al. (2020b) 4 ; • For open-ended dialogue system tasks, we use BLEU-1, BLEU-2, Distinct-1, and Distinct-2 for evaluation. For DSTC7-AVSD we also utilize CIDEr (Vedantam et al., 2015) . We employ NLTK 3.5 with smoothing function 7 to compute BLEU for PersonaChat and DailyDialog, and utilize the script 5 to evaluate DSTC7-AVSD; • For question answering tasks, we use Exact Match (EM) and Macro-averaged F1 score (F1) for evaluation. We use the provided script for CoQA 6 and SQuAD 7 . • For question generation tasks, we use BLEU-4, ROUGE-L, and METEOR for evaluation. We use the script provided by Dong et al. (2019) 8 ; • For story generation, we employ nucleus sampling with p = 0.9 and temperature of 0.7 following Guan et al. (2021) . We use corpus BLEU-1, BLEU-2, Distinct-1, and Distinct-4 for evaluation.We use NLTK 3.5 to calculate corpus BLEU following Guan et al. (2021) ; • For task-oriented dialogue system tasks, we use BLEU (-4) , inform (rate), success (rate), and combined score for evaluation. Inform and success are two specially designed accuracy metrics for task-oriented dialogue system, and the combined score is defined as (Inform + Success) × 0.5 + BLEU (Budzianowski et al., 2018) . We use the script provided by Su et al. (2022) 9 ; • For text summarization tasks, we use ROUGE-1, ROUGE-2, and ROUGE-L for evaluation. We use the toolkit files2rouge 10 .For the experiments in Section 5 (Tables 4 and 5 ), the fine-tuning and evaluation details are as follows:• For paraphrase generation tasks, we employ the fine-tuning and evaluation scripts provided by AESOP (Sun et al., 2021) 11 . The evaluation metrics are BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, and METEOR. • For text style transfer tasks, we employ the fine-tuning and evaluation scripts provided by SC & BLEU (Lai et al., 2021) 12 . We conduct the informal-to-formal transfer and train the model on the data from both the E&M and F&R domains following Lai et al. (2021) . The evaluation metrics are BLEU-4, accuracy, and HM. Accuracy is calculated by a pre-trained TextCNN to evaluate the style strength, and HM denotes the harmonic mean of BLEU-4 and style accuracy (Lai et al., 2021) . • For GLUE tasks, we utilize the fine-tuning code provided by Hugging Face 13 . The hyperparameters are consistent with original BART (Lewis et al., 2020) 14 . The evaluation is computed by the official website 15 . 

E QUALITATIVE EXAMPLES

In this section, we showcase the linearized inputs, human-written task prompts, and corresponding outputs of a single dataset for tasks in Section 4. We provide the results of BART, MVP, and MVP+S under full tuning settings. To minimize human intervention, we select the first and second instances of the test set.Under review as a conference paper at ICLR 2023 Table 12 : The first instance from the CNN/Daily Mail dataset. Human-written task prompts are labeled in italic. The setting is the same below.

Input

Summarize: Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin's comments follow claims by two magazines, German daily Bild and FrenchParis Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. "One can hear cries of 'My God' in several languages," Paris Match reported. "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the cockpit door with a heavy object. Towards the end, after a heavy shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt, editor-in-chief of Bild online. An official with France's accident investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col.Jean-Marc Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said, but that they "hadn't been exploited yet." Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working hand-in-hand with investigators. But none of the cell phones found so far have been sent to the institute, Menichini said. Asked whether staff involved in the search could have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett: Outfront" that he had watched the video and French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested. In the meantime, the recovery of the victims' personal belongings will start Wednesday, Menichini said. Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board. Check out the latest from our correspondents. The details about Lubitz's correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and Lubitz's possible motive for downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor's office in Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot's license. Kumpa emphasized there's no evidence suggesting Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to lose his pilot's license, a European government official briefed on the investigation told CNN on Tuesday. While flying was "a big part of his life," the source said, it's only one theory being considered. Another source, a law enforcement official briefed on the investigation, also told CNN that authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems. Lubitz's girlfriend told investigators he had seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had psychological issues, the European government official said.But no matter what details emerge about his previous mental health struggles, there's more to the story, said Brian Russell, a forensic psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they weren't going to keep doing their job and they're upset about that and so they're suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person's problems." Germanwings crash compensation: What we know. Who was the captain of Germanwings Flight 9525? CNN's Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura Smith-Spark wrote from London. CNN's Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report.

Gold

Match are "very confident" the video clip is real, an editor says. Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.

BART

Prosecutor: "A person who has such a video needs to immediately give it to the investigators" French Gendarmerie spokesman: Cell phones have been collected at the site, but haven't been exploited.

MVP

Prosecutor: "So far no videos were used in the crash investigation" An official with France's accident investigation agency says the agency is not aware of any video.

MVP+S

Prosecutor: "So far no videos were used in the crash investigation" Two publications claim to have found a cell phone video of the crash. "It is a very disturbing scene," says editor-in-chief of Bild online.Table 16 : The first instance from the SQuAD dataset. Input Generate the question based on the answer: Saint Bernadette Soubirous [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ' s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary .

Gold

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France ?

BART

Who is believed to have appeared to the Virgin Mary at Lourdes ? MVP Who did the Virgin Mary appear to in Lourdes ?

MVP+S

The Grotto is a replica of the grotto at Lourdes , France where the Virgin Mary appeared to whom ? Table 17 : The second instance from the SQuAD dataset. Input Generate the question based on the answer: a copper statue of Christ [SEP] Architecturally , the school has a Catholic character . Atop the Main Building ' s gold dome is a golden statue of the Virgin Mary . Immediately in front of the Main Building and facing it , is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes " . Next to the Main Building is the Basilica of the Sacred Heart . Immediately behind the basilica is the Grotto , a Marian place of prayer and reflection . It is a replica of the grotto at Lourdes , France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858 . At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ) , is a simple , modern stone statue of Mary . Table 20 : The first instance from the ROCStories dataset. Input Given the story title: male was out jogging one morning .

Gold

the weather was crisp and cool . male felt good and energetic . he decided to keep jogging longer than normal . male went several more miles out of his way .BART he saw a man struggling to walk across a field . male stopped to help him . he saw a guy struggling to walking across a fields . the man thanked male profusely ! MVP he ran into a man who was holding a knife . male was scared , so he ran away . the man threatened male . male ran back into his house .

MVP+S

he noticed a sound coming from a tree . male went to investigate . he found a small baby bird stuck in the trunk . male took the bird to the vet and they released it .Table 21 : The second instance from the ROCStories dataset. Input Given the story title: ken was driving around in the snow .Gold he needed to get home from work . he was driving slowly to avoid accidents . unfortunately the roads were too slick and ken lost control . his tires lost traction and he hit a tree .BART he hit a patch of black ice . ken slid and hurt his knee . he went to the doctor for it . krampus was sent home with a concussion . MVP he was in his car and it was very cold . ken started to back up . the car started to make a loud noise . kena had to back out and call for help . he was having trouble starting his car . he had to push it a couple of blocks . his car would n't start . ken had to call a tow truck for help . 

