DOWNSTREAM DATASETS MAKE SURPRISINGLY GOOD PRETRAINING CORPORA

Abstract

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELEC-TRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around 10×-500× less data), outperforming the latter on 7 and 5 datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Besides classification tasks, self-pretraining also provides benefits on structured output prediction tasks such as span based question answering and commonsense inference, often providing more than 50% of the performance boosts provided by pretraining on the BookWiki corpus. Our results hint that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.

1. INTRODUCTION

For training predictive models operating on natural language data, the current best practice is to pretrain models on large unlabeled upstream corpora to optimize self-supervised objectives, for example, masked language modeling (MLM); the resulting weights are then used to initialize models that are subsequently trained (finetuned) on the labeled downstream data available for the task at hand. Large-scale pretrained models typically provide significant performance boosts when compared to models trained directly on the downstream task (with random initializations) (Peters et al., 2018; Devlin et al., 2019; Chiang & Lee, 2020; Krishna et al., 2021) . Upstream corpora tend to be significantly larger than the downstream corpora and the success of this approach is often attributed to its ability to leverage these massive upstream corpora (Liu et al., 2019; Yang et al., 2019) . For example, the seminal BERT model (Devlin et al., 2019) was pretrained using the BookWiki corpus which is a combination of English Wikipedia and BooksCorpus (Zhu et al., 2015) , totaling 13GB of plain text. Subsequent models have moved on to web-scale data. For example, XLNet (Yang et al., 2019 ), RoBERTa (Liu et al., 2019 ), and T5 (Raffel et al., 2020) ), were trained on 158GB, 160GB and 750GB of data, respectively. As upstream corpus size and downstream performance have gone up, popular attempts at explaining these gains have focused on themes of "knowledge transfer" from the upstream corpus, attributing them to shared linguistic structure, semantics (Lina et al., 2019; Tenney et al., 2019) , and facts about the world (Petroni et al., 2019) . However, since the introduction of large-scale pretraining corpora occurred together with the invention of self-supervised pretraining objectives (e.g. masked language modeling (Devlin et al., 2019) and replaced token detection (Clark et al., 2019) ), it remains unclear to what extent large-scale corpora are integral to these leaps in performance. For several tasks, especially summarization, recent works have managed to achieve surprising performance gains in settings where Figure 1 : Aggregate performance of an ELECTRA model across 10 finetuning datasets when it is (i) randomly initialized (ii) pretrained on upstream corpus (BookWiki) (iii) pretrained on the finetuning dataset itself the upstream corpus is created synthetically with arbitrary symbols, but the pretraining objective is designed to capture some of the structure of the task (Krishna et al., 2021; Wu et al., 2022) . In this work, we ask just how much of pretraining's benefits could be realized in the absence of upstream corpora by pretraining directly on the downstream corpora (with the same self-supervised objectives). We find that this approach, which we call self-pretraining, often rivals the performance boosts conferred by off-the-shelf models pretrained on large upstream corpora (Figure 1 ), even outperforming them on 7 out of 10 datasets. Prior research has shown that additional self-supervised pretraining of off-the-shelf models using the downstream data can give further gains (Gururangan et al., 2020) . Yao et al. (2022) showed that one can use the downstream data to retrieve a tiny subset of a large general corpus for pretraining effciently without sacrificing performance. Our study goes further, showing that even when starting from random initializations, and without using any external data beyond the downstream data itself, self-pretraining can rival standard practices. Since self-pretraining requires the same data that must already be available for downstream finetuning, the benefits of pretraining in this case cannot be attributed to transfer of knowledge from the upstream corpus. Instead, these benefits can only be attributed to the pretraining objective, which is possibly able to learn some inductive biases better than the finetuning objective (e.g. linguistic knowledge Tenney et al. ( 2019)), or perhaps simply initialize network parameters such that their statistics lead to better optimization during finetuning (Wu et al., 2022) . While we note that similar observations have been made in the computer vision community (El-Nouby et al., 2021) , we argue that it is especially important to establish these phenomena in the language domain, for which building on self-supervised pretrained models is now the ubiquitous practice of the vast majority of practitioners. To understand differences in predictions with different pretraining strategies (i.e., between selfpretrained and off-the-shelf models), we analyse the errors made by these models on the same downstream data. Despite similar performance of these models, we find that self-pretrained and off-the-shelf models make significantly less correlated errors when compared to two independently finetuned models pretrained with either strategy. However, we observe that these uncorrelated mistakes do not transfer to improvements in the ensemble performance. We find that models pretrained on one downstream dataset often perform surprisingly well when finetuned to other downstream datasets. Notably, the downstream datasets in our study come from a wide variety of domains such as news, online forums, tweets, reviews etc (Table 1 ). Nevertheless, we find that pretraining on any of these downstream datasets delivers significant performance gains on most datasets (greater than half of off-the-shelf model's gains in 88% of cases) irrespective of domain. However, the best performance on a downstream dataset is usually achieved by the model pretrained on that dataset itself. Models pretrained on downstream datasets perform well on the GLUE benchmark too, despite having considerably less long-term dependencies as compared to standard upstream corpora. For example, the MNLI corpus consists of 2-sentence input texts that are concatenated in random order. In addition to classification tasks, we also experiment with tasks such as span-based question answering, named entity recognition, and grounded commonsense inference. Self-pretraining delivers around 40-80% of the performance boost compared to models pretrained on the BookWiki corpus across ELECTRA and Roberta models. Hence, self-pretraining can perform better than fine-tuning randomly initialized models even for tasks that require prediction of more complex structured output than a single label, and for tasks whose solution relies on commonsense knowledge.

