AMBERT: A PRE-TRAINED LANGUAGE MODEL WITH MULTI-GRAINED TOKENIZATION

Abstract

Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (finegrained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT outperforms the existing best performing models in almost all cases, particularly the improvements are significant for Chinese. We also develop a version of AMBERT which performs equally well as AMBERT but uses about half of its inference time.

1. INTRODUCTION

Pre-trained models such as BERT, RoBERTa, and ALBERT (Devlin et al., 2018; Liu et al., 2019; Lan et al., 2019) have shown great power in natural language understanding (NLU). The Transformerbased language models are first learned from a large corpus in pre-training, and then learned from labeled data of a downstream task in fine-tuning. With Transformer (Vaswani et al., 2017) , pre-training technique, and big data, the models can effectively capture the lexical, syntactic, and semantic relations between the tokens in the input text and achieve the state-of-the-art performances in many NLU tasks, such as sentiment analysis, text entailment, and machine reading comprehension. In BERT, for example, pre-training is mainly conducted based on mask language modeling (MLM) in which about 15% of the tokens in the input text are masked with a special token [MASK] , and the goal is to reconstruct the original text from the masked text. Fine-tuning is separately performed for individual tasks as text classification, text matching, text span detection, etc. Usually, the tokens in the input text are fine-grained; for example, they are words or sub-words in English and characters in Chinese. In principle, the tokens can also be coarse-grained, that is, for example, phrases in English and words in Chinese. There are many multi-word expressions in English such as 'New York' and 'ice cream' and the use of phrases also appears to be reasonable. It is more sensible to use words (including single character words) in Chinese, because they are basic lexical units. In fact, all existing pre-trained language models employ single-grained (usually fine-grained) tokenization. Previous work indicates that the fine-grained approach and the coarse-grained approach have both pros and cons. The tokens in the fine-grained approach are less complete as lexical units but their representations are easier to learn (because there are less token types and more tokens in training data), while the tokens in the coarse-grained approach are more complete as lexical units but their 1

