LATER SPAN ADAPTATION FOR LANGUAGE UNDERSTANDING

Abstract

Pre-trained contextualized language models (PrLMs) broadly use fine-grained tokens (words or sub-words) as minimal linguistic units in the pre-training phase. Introducing span-level information in pre-training has shown capable of further enhancing PrLMs. However, such methods require enormous resources and lack adaptivity due to huge computational requirements from pre-training. Instead of too early fixing the linguistic unit input as nearly all previous work did, we propose a novel method that combines span-level information into the representations generated by PrLMs during the fine-tuning phase for better flexibility. In this way, the modeling procedure of span-level texts can be more adaptive to different downstream tasks. In detail, we divide the sentence into various span components according to the segmentation generated by a pre-sampled dictionary. Based on the sub-token-level representation provided by PrLMs, we bridge the connection between the tokens in each span and yield an accumulated representation with enhanced span-level information. Experiments on the GLUE benchmark show that our approach remarkably improves the performance of PrLMs in various natural language understanding tasks.

1. INTRODUCTION

Pre-trained contextualized language models (PrLMs) such as BERT (Devlin et al., 2018) , XLNet (Yang et al., 2019) , ELECTRA (Clark et al., 2020) have led to strong performance gains in downstream natural language understanding (NLU) tasks. Such models' impressive power to generate effective contextualized representations is established by using well-designed selfsupervised training on a large text corpus. Taking BERT as an example, the model used Masked Language Modeling (MLM) and Nest Sentence Prediction (NSP) as pre-training objects and was trained on a corpus of 3.3 billion words. PrLMs commonly generate fine-grained representations, i.e., subword-level embeddings, to adapt to broad applications. Different downstream tasks sometimes require representations with different granularity. For example, sentence-level tasks such as natural language inference (Bowman et al., 2015; Nangia et al., 2017) , demand an overall sentence-level analysis to predict the relationships between each sentence. There are also token-level tasks, including question answering and named entity recognition, which require models to generate fine-grained output at the token level (Rajpurkar et al., 2016b; Sang & De Meulder, 2003) . Therefore, the representations provided by PrLMs are finegrained (word or sub-word), which can be easily recombined to representations at any granularity, and applied to various downstream tasks without substantial task-specific modifications. Besides fine-grained tokens and sentences, coarse-grained span-level language units such as phrases, name entities are also essential for NLU tasks. Previous works indicate that the capability to capture span-level information can be enhanced by altering pre-training objectives. SpanBERT (Joshi et al., 2019) extends BERT by masking and predicting text spans rather than a single token for pre-training. ERNIE models (Sun et al., 2019; Zhang et al., 2019a) employ entity level masking as a strategy for pre-training. StructBERT (Wang et al., 2019) encourages PrLMs to incorporate span-level structural information by adding trigram de-shuffling as a new pre-training objective. The methods mentioned above show that the incorporation of span-level information in the pre-training phase is effective for various downstream NLU tasks. However, since different downstream tasks have different requirements for span-level information, the strategy of incorporating span-level information in pre-training might not be suitable for all downstream tasks. For example, by leveraging entity level masking strategy in pre-training, ERNIE models (Sun et al., 2019; Zhang et al., 2019a) achieve remarkable gain in entity typing and Relation Classification, but when it comes to language inference tasks like MNLI (Nangia et al., 2017) , its performance is even worse than BERT. Therefore, incorporating span-level information more flexibly and more universally, is imperatively necessary. The representations generated by PrLMs are supposed to be widely applicable for general cases; meanwhile, they are also expected to be flexibly adapted to various specific downstream tasks. Thus introducing span-level clues in a good timing matters a lot. In this paper, we propose a novel method, Later Span Adaptapan (LaSA), that would enhance the use of span-level information in a task-specific fine-tuning manner, which is lighter and more adaptive compared to existing methods. In this work, based on the fine-grained representation generated by BERT, a computationally motivated segmentation is applied to further enhance the utilization of span-level information. Previous work has used semantic role labeling (SRL) (Zhang et al., 2019b) or dependency parsing (Zhou et al., 2019) as auxiliary segmentation tools. Nevertheless, these methods require extra parsing procedure, which reduces the simplicity of use. In our method, the segmentation is obtained according to a pre-sampled n-gram dictionary. The fine-grained representation in the same span within the segmentation is aggregated to a span-level representation. On this basis, the span-level representations are further integrated to generate a sentence-level representation to make the most of both fine-grained and span-level information. We conduct the experiments and analysis on the GLUE benchmark (Wang et al., 2018) , which contain various NLU tasks, including natural language inference, semantic similarity, and text classification. Empirical results show that our method can enhance the performance of PrLMs to the same degree as altering the pre-training objectives, but more simply and adaptively. Ablation studies and analysis verify that the introduced method is essential to the further performance improvement.

2.1. PRE-TRAINED LANGUAGE MODELS

Learning reliable and broadly applicable word representations has long been a prosperous topic for the NLP community. Language modeling objectives are shown effective for generating satisfying distributed representation (Mnih & Hinton, 2009) . By leveraging neural network and large text corpus, Mikolov et al. (2013) and Pennington et al. (2014) achieve to train widely applicable word embeddings in an unsupervised manner. ELMo (Peters et al., 2018) further advances state of the art for various downstream NLU tasks by generating deep contextualized word representations. Equipped with Transformer (Vaswani et al., 2017) , GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) further explore transfer learning, where models are firstly pre-trained on a large corpus and then applied to downstream tasks in a fine-tuning manner. Recent PrLMs extends BERT in multiple ways, including using permutation language model (Yang et al., 2019) , training on a larger corpus and using more efficient parameters (Liu et al., 2019b) , leveraging parameter sharing strategy (Lan et al., 2019) , employing GAN-style architecture (Clark et al., 2020) . T5 (Raffel et al., 2019) further explores the limit of transfer learning by conducting exhaustive experiments.

2.2. COARSE-GRAINED PRE-TRAINING METHODS

Previous works indicate that the incorporation of coarse-grained information in pre-training can enhance the performance of the PrLMs. Initially, BERT uses the prediction of single masked tokens as one of the pre-training objectives. Since BERT uses WordPiece embeddings (Wu et al., 2016) , sentences are tokenized into the sub-word level so that the masked token can be a sub-word token such as "##ing". Devlin et al. (2018) then points out that instead of masking a single token, using "whole word masking" strategy can further improve BERT's performance. After that, (Sun et al., 2019; Zhang et al., 2019a) verify that PrLMs can benefit from entity-level masking strategy in pretraining. In SpanBERT (Joshi et al., 2019) , the model can better represent and predict spans of text by masking random contiguous spans in pre-training. Recently, by making use of both fine-grained and coarse-grained tokenization, AMBERT (Zhang & Li, 2020) outperforms its precursor in various

