SEGMENTING NATURAL LANGUAGE SENTENCES VIA LEXICAL UNIT ANALYSIS

Abstract

In this work, we present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks. Given a natural language sentence, LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximum scoring one. LUA enjoys a number of appealing properties such as inherently guaranteeing the predicted segmentation to be valid and facilitating globally optimal training and inference. Besides, the practical time complexity of LUA can be reduced to linear time, which is very efficient. We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our models have achieved the state-of-the-art performances on 13 of them. The results also show that the F1 score of identifying long-length segments is notably improved.

1. INTRODUCTION

Sequence segmentation is essentially the process of partitioning a sequence of fine-grained lexical units into a sequence of coarse-grained ones. In some scenarios, each composed unit is assigned a categorical label. For example, Chinese word segmentation splits a character sequence into a word sequence (Xue, 2003) . Syntactic chunking segments a word sequence into a sequence of labeled groups of words (i.e., constituents) (Sang & Buchholz, 2000) . There are currently two mainstream approaches to sequence segmentation. The most common is to regard it as a sequence labeling problem by using IOB tagging scheme (Mesnil et al., 2014; Ma & Hovy, 2016; Liu et al., 2019b; Chen et al., 2019a; Luo et al., 2020) . A representative work is Bidirectional LSTM-CRF (Huang et al., 2015) , which adopts LSTM (Hochreiter & Schmidhuber, 1997) to read an input sentence and CRF (Lafferty et al., 2001) to decode the label sequence. This type of method is very effective, providing tons of state-of-the-art performances. However, it is vulnerable to producing invalid labels, for instance, "O, I-tag, I-tag". This problem is very severe in low resource settings (Peng et al., 2017) . In experiments (see section 4.6), we also find that it performs poorly in recognizing long-length segments. Recently, there is a growing interest in span-based models (Zhai et al., 2017; Li et al., 2019; Yu et al., 2020) . They treat a span rather than a token as the basic unit for labeling. Li et al. ( 2019) cast named entity recognition (NER) to a machine reading comprehension (MRC) task, where entities are extracted as retrieving answer spans. Yu et al. ( 2020) rank all the spans in terms of the scores predicted by a bi-affine model (Dozat & Manning, 2016) . In NER, span-based models have significantly outperformed their sequence labeling based counterparts. While these methods circumvent the use of IOB tagging scheme, they still rely on post-processing rules to guarantee the extracted span set to be valid. Moreover, since span-based models are locally normalized at span level, they potentially suffer from the label bias problem (Lafferty et al., 2001) . This paper seeks to provide a new framework which infers the segmentation of a unit sequence by directly selecting from all valid segmentation candidates, instead of manipulating tokens or spans. To this end, we propose Lexical Unit Analysis (LUA) in this paper. LUA assigns a score to every valid segmentation candidate and leverages dynamic programming (DP) (Bellman, 1966) to search for the maximum scoring one. The score of a segmentation is computed by using the scores of its all segments. Besides, we adopt neural networks to score every segment of the input sentence.

