BED: BOUNDARY-ENHANCED DECODER FOR CHI-NESE WORD SEGMENTATION

Abstract

Chinese Word Segmentation (CWS) is an essential fundamental step in the Chinese NLP processing pipeline. In recent years, with the development of deep learning and pre-training language models, multitudinous CWS models based on pre-training models have been proposed, and the performance of CWS models has been dramatically improved. However, CWS remains an open problem that deserves further study, such as CWS models are subjected to OOV words. To our knowledge, currently proposed CWS approaches mainly refine on the encoder part of CWS model, such as incorporating more word information into the encoder or doing pre-training related to the CWS task, etc. And there is no attempt to improve the decoder's performance of CWS model. This paper proposes an optimized decoder for the CWS model called Boundary-Enhanced Decoder (BED). It could bring 0.05% and 0.69% improvement on Average-F1 and OOV Average-F1 on four benchmark datasets. We also publish our implementation of BED.

1. INTRODUCTION

Chinese Word Segmentation (CWS) is an essential step in the Chinese NLP processing pipeline. In languages like Chinese, there are no evident word boundaries in a sentence. Unlike English, sentences are naturally split into separate words by spaces in the text. When we do a Chinese NLP task, we are required to segment the sentence into words at first and then feed words into a downstream model. Although character-level models have achieved good results on many NLP tasks in recent years, many studies have shown that models incorporating word information can improve their performance Tian et al. (2020); Liu et al. (2021); Zhang & Yang (2018) . So a good word segmentation model is significant for Chinese NLP tasks. Although the effect of the Chinese word segmentation model has been dramatically improved with the recent development of the pre-training model, there are still multitudinous problems in CWS. For example, the model usually performs poorly on difficult segmentation words, and the OOV words are one of them. OOV words are words for which the model was not taught how to segment them out in the training phase. Many studies show that the model often has a poorer segmentation effect for OOV words than common words. To alleviate and solve this problem to a certain extent, we propose the boundary-enhanced decoder (BED) in this paper. Our proposed method is inspired by the process of humans doing word segmentation tasks. The difficulty of word segmentation is different for each word in one sentence. For some words in a sentence, it is easy for us to figure out how to segment them, while others need repeated scrutiny and pondering to be correctly segmented. Therefore, when people perform word segmentation, they tend to preliminarily segment easy and relatively specific words' boundary, such as punctuation and some transition words, so that the sentence is roughly divided into large blocks. Then, the content in large blocks is more finely segmented. For example, when we segment the sentence in Figure 1 , "王 钟 翰/论 证/后 金/在/进 入/辽 沈/前/，/已/属/奴隶制/社会/。(Wang Zhonghan argues that Post-Jin had already belonged to a slave society before entering Liao and Shen.)". We can split it into coarse-grained parts effortlessly, like the second layer in Figure 1 . Blocks with a red background are words. All words can be determined in the first segmentation, except for the "论证后金 (argues Post-Jin)" part with yellow background needs to be more fine-grained segmentated. There are two ways to segment it, one way is "{论证, 后, 金} ({argues, after, Jin})" and another is "{论证, 后金}({argues, Post-Jin})" which

