MVP-BERT: REDESIGNING VOCABULARIES FOR CHI-NESE BERT AND MULTI-VOCAB PRETRAINING

Abstract

Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert Devlin et al. ( 2018), which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, seg tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, seg tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve seg tok's performances on sequence labeling tasks.

1. INTRODUCTION

The pretrained language models (PLMs) including BERT Devlin et al. (2018) and its variants Yang et al. (2019) ; Liu et al. (2019b) have been proven beneficial for many natural language processing (NLP) tasks, such as text classification, question answering Rajpurkar et al. (2018) and natural language inference (NLI) Bowman et al. (2015) , on English, Chinese and many other languages. Despite they brings amazing improvements for Chinese NLP tasks, most of the Chinese PLMs still use the vocabulary (vocab) provided by Google Chinese Bert Devlin et al. (2018) . Google Chinese Bert is a character (char) based model, since it splits the Chinese characters with blank spaces. In the pre-BERT era, a part of the literature on Chinese natural language processing (NLP) first do Chinese word segmentation (CWS) to divide text into sequences of words, and use a word based vocab in NLP models Xu et al. (2015) ; Zou et al. (2013) . There are a lot of arguments on which vocab a Chinese NLP model should adopt. The advantages of char based models are clear. First, Char based vocab is smaller, thus reducing the model size. Second, it does not rely on CWS, thus avoiding word segmentation error, which can directly result in performance gain in span based tasks such as named entity recognition (NER). Third, char-based models are less vulnerable to data sparsity or the presence of out-of-vocab (OOV) words, and thus less prone to over-fitting (Li et al. (2019) ). However, word based model has its advantages. First, it will result in shorter sequences than char based counterparties, thus are faster. Second, words are less ambiguous, thus may be helpful for models to learn the semantic meanings of words. Third, with a word based model, exposure biases may be reduced in text generation tasks (Zhao et al. (2013) ). Another branch of literature try to strike a balance between the two by combining word based embedding with character based embedding Yin et al. (2016); Dong et al. (2016) . In this article, we try to strike a balance between the char based model and word based model and provides alternative approaches for building a vocab for Chinese PLMs. In this article, there are three approaches to build a vocab for Chinese PLMs: (1) following Devlin et al. (2018) , separate the Chinese characters with white spaces, and then learn a sub-word tokenizer (denote as char); (2) first segment the sentences with a CWS toolkit like jieba 1 , and then learn a sub-word tokenizer (denoted as seg tok); (3) do CWS and kept the high-frequency words as tokens and low-frequency words will be tokenized by seg tok (denoted as seg). See Figure 1 for their workflow of processing an input sentence. Note that the first one is essentially the same with the vocab of Google Chinese BERT. Inspired by the previous work that incorporate multiple vocabularies (vocabs) or combine multiple vocabs in an natural wayYin et al. ( 2016); Dong et al. (2016) , we also investigate a series of strategies, which we will call Multi-Vocab Pretraining (MVP) strategies. The first version of MVP is to incorporate a hierarchical structure to combine the char based vocab and word based vocab. From the viewpoint of model forward pass, the embeddings of Chinese characters are aggregated to form the vector representations of multi-gram words or tokens, which then are fed into transformer encoders, and then the word based vocab will be used in masked language model (MLM) training. We will denote this version of MVP as MVP hier . Note that in MVP hier , the char based vocab is built by splitting the Chinese words in the word based vocab into Chinese chars, and non-Chinese tokens are kept the same. We will denote this strategy as MVP hier (V ), where V is a word based vocab. The second version of MVP (denoted as MVP pair ) is to employ a pair of vocabs in MLM. Due to limited resources, in this article we only consider the pair between seg tok and char. MVP pair is depicted in Figure 2(c ). In MVP pair , a sentence (or a concatenation of multiple sentences in pretraining), is processed and tokenized both in seg tok and char, and the two sentences are encoded by two parameter-sharing transformer encoders. Whole word masking Cui et al. ( 2019b) is applied for pretraining. For example, the word "篮球" (basketball) is masked. The left encoder, which is with seg tok, has to predict the single masked token is "篮球", and the right encoder has to predict "篮" and "球" for two masked tokens. MLM loss from both sides will be added with weights. With MVP pair , parameter sharing enables the single vocab model to absorb information from the other vocab, thus enhancing its expressiveness. Note that after pre-training, one can either keep one of the encoder or both encoders for downstream finetuning. We will denote this strategy as MVP pair (V 1 , V 2 , i), where V 1 and V 2 are two different vocabs, i = s means only the encoder with V 1 is kept for finetuning (single vocab model), and i = e means both encoders are kept (ensemble model). The third version of MVP (denoted as MVP obj ) is depicted in Figure 2(b) . In MVP obj , the sentence is encoded only once with a fine-grained vocab, and MLM task with that vocab is conducted. As in the figure, he word "喜欢" (like) is masked, and under the vocab of char, the PLM has to predict "喜" and "欢" for the two masked tokens. As additional training objective, we will employ a more coarse-grained vocab like seg tok and ask the model to use the starting token ("喜")'s representation to predict the original word under seg tok. We will denote this strategy as MVP pair (V 1 , V 2 ), where V 1 and V 2 are a pair of vocabs and V 1 is the more fine-grained one. Extensive experiments and ablation studies are conducted. We select BPE implemented by sentencepiecefoot_1 as the sub-word model, and Albert ? (base model) as our PLM. Pre-training is done on Chinese Wikipedia corpusfoot_2 , which is also the corpus on which we build the different vocabs. After



https://github.com/fxsjy/jieba https://github.com/google/sentencepiece https://dumps.wikimedia.org/zhwiki/latest/



Figure 1: An illustration of how to process input sentence into tokens under different methods we define.

