BED: BOUNDARY-ENHANCED DECODER FOR CHI-NESE WORD SEGMENTATION

Abstract

Chinese Word Segmentation (CWS) is an essential fundamental step in the Chinese NLP processing pipeline. In recent years, with the development of deep learning and pre-training language models, multitudinous CWS models based on pre-training models have been proposed, and the performance of CWS models has been dramatically improved. However, CWS remains an open problem that deserves further study, such as CWS models are subjected to OOV words. To our knowledge, currently proposed CWS approaches mainly refine on the encoder part of CWS model, such as incorporating more word information into the encoder or doing pre-training related to the CWS task, etc. And there is no attempt to improve the decoder's performance of CWS model. This paper proposes an optimized decoder for the CWS model called Boundary-Enhanced Decoder (BED). It could bring 0.05% and 0.69% improvement on Average-F1 and OOV Average-F1 on four benchmark datasets. We also publish our implementation of BED.

1. INTRODUCTION

Chinese Word Segmentation (CWS) is an essential step in the Chinese NLP processing pipeline. In languages like Chinese, there are no evident word boundaries in a sentence. Unlike English, sentences are naturally split into separate words by spaces in the text. When we do a Chinese NLP task, we are required to segment the sentence into words at first and then feed words into a downstream model. Although character-level models have achieved good results on many NLP tasks in recent years, many studies have shown that models incorporating word information can improve their performance Tian et al. (2020); Liu et al. (2021); Zhang & Yang (2018) . So a good word segmentation model is significant for Chinese NLP tasks. Although the effect of the Chinese word segmentation model has been dramatically improved with the recent development of the pre-training model, there are still multitudinous problems in CWS. For example, the model usually performs poorly on difficult segmentation words, and the OOV words are one of them. OOV words are words for which the model was not taught how to segment them out in the training phase. Many studies show that the model often has a poorer segmentation effect for OOV words than common words. To alleviate and solve this problem to a certain extent, we propose the boundary-enhanced decoder (BED) in this paper. Our proposed method is inspired by the process of humans doing word segmentation tasks. The difficulty of word segmentation is different for each word in one sentence. For some words in a sentence, it is easy for us to figure out how to segment them, while others need repeated scrutiny and pondering to be correctly segmented. Therefore, when people perform word segmentation, they tend to preliminarily segment easy and relatively specific words' boundary, such as punctuation and some transition words, so that the sentence is roughly divided into large blocks. Then, the content in large blocks is more finely segmented. For example, when we segment the sentence in Figure 1 , "王 钟 翰/论 证/后 金/在/进 入/辽 沈/前/，/已/属/奴隶制/社会/。(Wang Zhonghan argues that Post-Jin had already belonged to a slave society before entering Liao and Shen.)". We can split it into coarse-grained parts effortlessly, like the second layer in Figure 1 . Blocks with a red background are words. All words can be determined in the first segmentation, except for the "论证后金 (argues Post-Jin)" part with yellow background needs to be more fine-grained segmentated. There are two ways to segment it, one way is "{论证, 后, 金} ({argues, after, Jin})" and another is "{论证, 后金}({argues, Post-Jin})" which is the correct segmentation. It is demanding for us to segment these words correctly. However, if we segment the relatively easy blocks of the sentence into words and then the hard block could be easy to be segmented, because of the excluding of other information's interference. Therefore, we propose BED, which imitates the process mentioned above. The model looks for easily tokenized positions, divides a whole sentence into several parts first, and then tokenizes each piece into more fine-grained words. This more intuitive approach can further improve the performance of the word segmentation model, especially the performance of OOV words. To our best knowledge, it is the first study trying to optimize the decoder part in a CWS model. In this paper, we will first introduce the related work of Chinese word segmentation in section 2 . Then we define the problem of CWS and introduce our proposed decoder and the entire model structure. We conduct exhaustive experiments, and the results and analysis will be described in Section 5.

2. RELATED WORK

Recently 



Figure 1: A human CWS case.

, with the development of pre-training models, many different pre-training models have been proposed Devlin et al. (2018); Liu et al. (2019); Clark et al. (2020); Zhang et al. (2019); Floridi & Chiriatti (2020); Lan et al. (2019); Bai et al. (2021); Dong et al. (2019). Approaches based on the pre-training model have achieved SOAT in many tasks of NLP Rajpurkar et al. (2016); Wang et al. (2018). The same is true for the CWS task. Some recently proposed methods based on pre-training models have greatly improved the model's effectiveness on the CWS. These methods exploit a pretrained masked language model as the text encoder and a crf layer or softmax layer as the decoder Liu et al. (2021); Ke et al. (2020); Huang et al. (2019); Tian et al. (2020); Meng et al. (2019); Huang et al. (2021). And then, the model is finetuned on the CWS-related corpus. Among these studies, some try to leverage external information, such as vocabularies, from the external corpus. Liu et al. (2021); Tian et al. (2020) both attempt to incorporate word information into the encoder. They all use character-level pre-trained models. During finetuning, the embeddings of the words in the dictionary contained in the sentence are added to the corresponding character position in the sentence. According to Liu et al. (2021) rule, different locations of fusing word information, to categorize models incorporating word information, Tian et al. (2020) fuses the word representation at the model level and Liu et al. (2021) fuses at the BERT level. These two fusion models improve the CWS model performance compared to models only using character-level information. Some works use one model to handle CWS datasets with distinct tokenization criteria uniformly. Ke et al. (2020) proposed a pre-training model specifically for CWS. The pre-training model which has a specific input position representing the word segmentation standard, makes it can unify different CWS criteria. And to be able to adapt to the new segmentation criteria in the fine-tuning stage, they used a meta-learning algorithm in the pre-training stage. Huang et al. (2019) proposed a model that uses BERT as the text encoder, and it extracts two kinds of information: one is the common information which shared by all criterions and another is the information only belonging to each

