IMPROVING LANGUAGE MODEL PRETRAINING WITH TEXT STRUCTURE INFORMATION

Abstract

Inter-sentence pretraining tasks learn from sentence relationships and facilitate high-level language understanding that cannot be directly learned in word-level pretraining tasks. However, we have found experimentally that existing intersentence methods for general-purpose language pretraining improve performance only at a relatively small scale but not at larger scales. For an alternative, we propose Text Structure Prediction (TSP), a more sophisticated inter-sentence task that uses text structure to provide more abundant self-supervised learning signals to pretraining models at larger scales. TSP classifies sentence pairs over six designed text structure relationships and it can be seen as an implicit form of learning high-level language understanding by identifying key concepts and relationships in texts. Experiments show that TSP provides improved performance on language understanding tasks for models at various scales. Our approach thus serves as an initial attempt to demonstrate that the exploitation of text structure can facilitate language understanding.

1. INTRODUCTION

General-purpose pretrained language models have been widely applied in natural language processing (NLP). The most representative model of these models is BERT (Devlin et al., 2019) , which is pretrained simultaneously on two pretraining tasks: a masked language model (MLM) task, and a next sentence prediction (NSP) task. While MLM masks words and requires models to fill clozes, NSP is an inter-sentence task of predicting whether two texts are continuous. Inter-sentence tasks learn relationships between sentences and facilitate high-level language understanding that is not directly learned by word-level pretraining tasks (Devlin et al., 2019) . However, the representative inter-sentence task, NSP, has been found to fail to improve performance (Liu et al., 2019; Yang et al., 2019) . Although a few successors of NSP have been proposed, they are still not widely adopted and researched. In this paper, we show that the existing inter-sentence methods for general-purpose language pretraining are actually suboptimal; to improve on those methods' weaknesses, we then propose an alternative that redefines what is learned from sentence relations. The existing general-purpose inter-sentence pretraining tasks include NSP (Devlin et al., 2019) , which discriminates whether two texts come from different documents; sentence order prediction (SOP) (Lan et al., 2020) , which discriminates whether two texts are swapped; and sentence structure objective (SSO) (Wang et al., 2020) , which discriminates whether two texts come from different documents or are swapped. To investigate claims of improved performance, we experimented these methods at three different scales (Small, Base, and Large), which mainly followed the setting of BERT (Devlin et al., 2019) . The model size and the amount of consumed data increased from Small to Base and then to Large scale (see Appendix A for the details). As seen in Figure 1 , our experimental results show that the existing methods improved performance only at the Small-scale whereas they undermined or failed to improve performance for models at larger scales. In investigating the reason for little improvement by the existing methods at larger scales, we noticed that all of the aforementioned methods split an input text into two segments and learn from only the relationship between the two segments, thus ignoring the text's underlying structure. As illustrated in Figure 2 , text structure can be seen as the organization of information in texts. Without text structure, a text becomes a long continuous word sequence, which is hard to read and makes it difficult to identify key concepts and logical relationships for humans. This intuitive understand- The game resolves around Qubits, the basic building block of a quantum computer. It's pretty straightforward (you won't need to learn any quantum entanglement math or physics) with the goal of increasing the number of Qubits while keeping them cool. The more Qubits you have, the more difficult it gets. Eventually, you'll "discover new upgrades, complete big research projects and hopefully become a little more curious about how we're building quantum computers," wrote Google Quantum head of education Abe Asfaw. The goal is to draw attention to quantum computing, because it seems there's a dearth of people working in the field. To that end, Google is bringing the game to the classroom, hoping to encourage educators to talk about the subject and expand access to quantum computing research. "We need more students pursuing careers building or using quantum computers, and understanding what it would be like to be a quantum scientist or engineer," wrote Asfaw. "For me, that's what World Quantum Day is all about: showing everyone what quantum computing really is and how they can get involved."

Paragraph II (4 sentences)

The details of the game.

Paragraph III (2 sentences)

Google's intention behind the game.

Paragraph IV (2 sentences)

The meaning of World Quantum Day. World Quantum Day was apparently yesterday, and Google feted the occasion with the launch of The Qubit Game, as spotted by 9to5Google. Created in partnership with Doublespeak games, it's a "playful journey to building a quantum computer, one qubit at a time," Google said. It also hopes the game, and World Quantum Day, will help generate some interest in the field.

Explain the reason

Figure 2 : Example of text structure in real text. The document comprises multiple paragraphs, where each paragraph is composed of multiple sentences. Each paragraph conveys a specific concept constructed by the sentences that the paragraph is composed of, while the order of concepts comes with a clear intention and forms a logical flow. This suggests that the combination of hierarchy and order in text structure hides valuable clues to high-level language comprehension. ing of human reading ability inspires us to explore the possibility that text structure can provide models' language understanding ability abundant learning signals of high-level semantics and their interaction, especially for models at larger scales. Hence, our goal in this paper is to demonstrate that learning from text structure can improve general-purpose language model pretraining. We thus propose a new inter-sentence task that better exploits text structure to examine our hypothesis. The proposed task, Text Structure Prediction (TSP), is our initial attempt to integrate text structure information into general-purpose language pretraining. Regarding the task design, we view text structure in terms of two axes: ordering and hierarchy. Hierarchies are nested groupings of sentences, such as paragraphs or sections, where each group conveys a specific concept and its su-



Figure1: Experimental results on the language understanding benchmark SuperGLUE(Wang et al.,  2019)  at the Small, Base, and Large scales. The experiment compared the effect on the performance of using different inter-sentence tasks (NSP, SOP, SSO, and our proposed task, TSP) when learned concurrently with the word-based MLM pretraining task. By taking advantage of text structure information, TSP outperforms pure MLM at different scales, whereas the other inter-sentence baselines (NSP, SOP, and SSO) failed to improve the performance at larger scales in our experiments.

