IMPROVING LANGUAGE MODEL PRETRAINING WITH TEXT STRUCTURE INFORMATION

Abstract

Inter-sentence pretraining tasks learn from sentence relationships and facilitate high-level language understanding that cannot be directly learned in word-level pretraining tasks. However, we have found experimentally that existing intersentence methods for general-purpose language pretraining improve performance only at a relatively small scale but not at larger scales. For an alternative, we propose Text Structure Prediction (TSP), a more sophisticated inter-sentence task that uses text structure to provide more abundant self-supervised learning signals to pretraining models at larger scales. TSP classifies sentence pairs over six designed text structure relationships and it can be seen as an implicit form of learning high-level language understanding by identifying key concepts and relationships in texts. Experiments show that TSP provides improved performance on language understanding tasks for models at various scales. Our approach thus serves as an initial attempt to demonstrate that the exploitation of text structure can facilitate language understanding.

1. INTRODUCTION

General-purpose pretrained language models have been widely applied in natural language processing (NLP). The most representative model of these models is BERT (Devlin et al., 2019) , which is pretrained simultaneously on two pretraining tasks: a masked language model (MLM) task, and a next sentence prediction (NSP) task. While MLM masks words and requires models to fill clozes, NSP is an inter-sentence task of predicting whether two texts are continuous. Inter-sentence tasks learn relationships between sentences and facilitate high-level language understanding that is not directly learned by word-level pretraining tasks (Devlin et al., 2019) . However, the representative inter-sentence task, NSP, has been found to fail to improve performance (Liu et al., 2019; Yang et al., 2019) . Although a few successors of NSP have been proposed, they are still not widely adopted and researched. In this paper, we show that the existing inter-sentence methods for general-purpose language pretraining are actually suboptimal; to improve on those methods' weaknesses, we then propose an alternative that redefines what is learned from sentence relations. The existing general-purpose inter-sentence pretraining tasks include NSP (Devlin et al., 2019) , which discriminates whether two texts come from different documents; sentence order prediction (SOP) (Lan et al., 2020), which discriminates whether two texts are swapped; and sentence structure objective (SSO) (Wang et al., 2020) , which discriminates whether two texts come from different documents or are swapped. To investigate claims of improved performance, we experimented these methods at three different scales (Small, Base, and Large), which mainly followed the setting of BERT (Devlin et al., 2019) . The model size and the amount of consumed data increased from Small to Base and then to Large scale (see Appendix A for the details). As seen in Figure 1 , our experimental results show that the existing methods improved performance only at the Small-scale whereas they undermined or failed to improve performance for models at larger scales. In investigating the reason for little improvement by the existing methods at larger scales, we noticed that all of the aforementioned methods split an input text into two segments and learn from only the relationship between the two segments, thus ignoring the text's underlying structure. As illustrated in Figure 2 , text structure can be seen as the organization of information in texts. Without text structure, a text becomes a long continuous word sequence, which is hard to read and makes it difficult to identify key concepts and logical relationships for humans. This intuitive understand-

