LATER SPAN ADAPTATION FOR LANGUAGE UNDERSTANDING

Abstract

Pre-trained contextualized language models (PrLMs) broadly use fine-grained tokens (words or sub-words) as minimal linguistic units in the pre-training phase. Introducing span-level information in pre-training has shown capable of further enhancing PrLMs. However, such methods require enormous resources and lack adaptivity due to huge computational requirements from pre-training. Instead of too early fixing the linguistic unit input as nearly all previous work did, we propose a novel method that combines span-level information into the representations generated by PrLMs during the fine-tuning phase for better flexibility. In this way, the modeling procedure of span-level texts can be more adaptive to different downstream tasks. In detail, we divide the sentence into various span components according to the segmentation generated by a pre-sampled dictionary. Based on the sub-token-level representation provided by PrLMs, we bridge the connection between the tokens in each span and yield an accumulated representation with enhanced span-level information. Experiments on the GLUE benchmark show that our approach remarkably improves the performance of PrLMs in various natural language understanding tasks.

1. INTRODUCTION

Pre-trained contextualized language models (PrLMs) such as BERT (Devlin et al., 2018) , XLNet (Yang et al., 2019) , ELECTRA (Clark et al., 2020) have led to strong performance gains in downstream natural language understanding (NLU) tasks. Such models' impressive power to generate effective contextualized representations is established by using well-designed selfsupervised training on a large text corpus. Taking BERT as an example, the model used Masked Language Modeling (MLM) and Nest Sentence Prediction (NSP) as pre-training objects and was trained on a corpus of 3.3 billion words. PrLMs commonly generate fine-grained representations, i.e., subword-level embeddings, to adapt to broad applications. Different downstream tasks sometimes require representations with different granularity. For example, sentence-level tasks such as natural language inference (Bowman et al., 2015; Nangia et al., 2017) , demand an overall sentence-level analysis to predict the relationships between each sentence. There are also token-level tasks, including question answering and named entity recognition, which require models to generate fine-grained output at the token level (Rajpurkar et al., 2016b; Sang & De Meulder, 2003) . Therefore, the representations provided by PrLMs are finegrained (word or sub-word), which can be easily recombined to representations at any granularity, and applied to various downstream tasks without substantial task-specific modifications. Besides fine-grained tokens and sentences, coarse-grained span-level language units such as phrases, name entities are also essential for NLU tasks. Previous works indicate that the capability to capture span-level information can be enhanced by altering pre-training objectives. SpanBERT (Joshi et al., 2019) extends BERT by masking and predicting text spans rather than a single token for pre-training. ERNIE models (Sun et al., 2019; Zhang et al., 2019a) employ entity level masking as a strategy for pre-training. StructBERT (Wang et al., 2019) encourages PrLMs to incorporate span-level structural information by adding trigram de-shuffling as a new pre-training objective. The methods mentioned above show that the incorporation of span-level information in the pre-training phase is effective for various downstream NLU tasks.

