MVP-BERT: REDESIGNING VOCABULARIES FOR CHI-NESE BERT AND MULTI-VOCAB PRETRAINING

Abstract

Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert Devlin et al. ( 2018), which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, seg tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, seg tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve seg tok's performances on sequence labeling tasks.

1. INTRODUCTION

The pretrained language models (PLMs) including BERT Devlin et al. (2018) The advantages of char based models are clear. First, Char based vocab is smaller, thus reducing the model size. Second, it does not rely on CWS, thus avoiding word segmentation error, which can directly result in performance gain in span based tasks such as named entity recognition (NER). Third, char-based models are less vulnerable to data sparsity or the presence of out-of-vocab (OOV) words, and thus less prone to over-fitting (Li et al. (2019) ). However, word based model has its advantages. First, it will result in shorter sequences than char based counterparties, thus are faster. Second, words are less ambiguous, thus may be helpful for models to learn the semantic meanings of words. Third, with a word based model, exposure biases may be reduced in text generation tasks (Zhao et al. (2013) ). Another branch of literature try to strike a balance between the two by combining word based embedding with character based embedding Yin et al. (2016); Dong et al. (2016) . In this article, we try to strike a balance between the char based model and word based model and provides alternative approaches for building a vocab for Chinese PLMs. In this article, there are three approaches to build a vocab for Chinese PLMs: (1) following Devlin et al. (2018) , separate the Chinese characters with white spaces, and then learn a sub-word tokenizer (denote as char); (2) first segment the sentences with a CWS toolkit like jiebafoot_0 , and then learn a sub-word tokenizer (denoted



https://github.com/fxsjy/jieba 1



and its variants Yang et al. (2019); Liu et al. (2019b) have been proven beneficial for many natural language processing (NLP) tasks, such as text classification, question answering Rajpurkar et al. (2018) and natural language inference (NLI) Bowman et al. (2015), on English, Chinese and many other languages. Despite they brings amazing improvements for Chinese NLP tasks, most of the Chinese PLMs still use the vocabulary (vocab) provided by Google Chinese Bert Devlin et al. (2018). Google Chinese Bert is a character (char) based model, since it splits the Chinese characters with blank spaces. In the pre-BERT era, a part of the literature on Chinese natural language processing (NLP) first do Chinese word segmentation (CWS) to divide text into sequences of words, and use a word based vocab in NLP models Xu et al. (2015); Zou et al. (2013). There are a lot of arguments on which vocab a Chinese NLP model should adopt.

