SEGMENTING NATURAL LANGUAGE SENTENCES VIA LEXICAL UNIT ANALYSIS

Abstract

In this work, we present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks. Given a natural language sentence, LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximum scoring one. LUA enjoys a number of appealing properties such as inherently guaranteeing the predicted segmentation to be valid and facilitating globally optimal training and inference. Besides, the practical time complexity of LUA can be reduced to linear time, which is very efficient. We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our models have achieved the state-of-the-art performances on 13 of them. The results also show that the F1 score of identifying long-length segments is notably improved.

1. INTRODUCTION

Sequence segmentation is essentially the process of partitioning a sequence of fine-grained lexical units into a sequence of coarse-grained ones. In some scenarios, each composed unit is assigned a categorical label. For example, Chinese word segmentation splits a character sequence into a word sequence (Xue, 2003) . Syntactic chunking segments a word sequence into a sequence of labeled groups of words (i.e., constituents) (Sang & Buchholz, 2000) . There are currently two mainstream approaches to sequence segmentation. The most common is to regard it as a sequence labeling problem by using IOB tagging scheme (Mesnil et al., 2014; Ma & Hovy, 2016; Liu et al., 2019b; Chen et al., 2019a; Luo et al., 2020) . A representative work is Bidirectional LSTM-CRF (Huang et al., 2015) , which adopts LSTM (Hochreiter & Schmidhuber, 1997) to read an input sentence and CRF (Lafferty et al., 2001) to decode the label sequence. This type of method is very effective, providing tons of state-of-the-art performances. However, it is vulnerable to producing invalid labels, for instance, "O, I-tag, I-tag". This problem is very severe in low resource settings (Peng et al., 2017) . In experiments (see section 4.6), we also find that it performs poorly in recognizing long-length segments. Recently, there is a growing interest in span-based models (Zhai et al., 2017; Li et al., 2019; Yu et al., 2020) . They treat a span rather than a token as the basic unit for labeling. Li et al. (2019) cast named entity recognition (NER) to a machine reading comprehension (MRC) task, where entities are extracted as retrieving answer spans. Yu et al. ( 2020) rank all the spans in terms of the scores predicted by a bi-affine model (Dozat & Manning, 2016) . In NER, span-based models have significantly outperformed their sequence labeling based counterparts. While these methods circumvent the use of IOB tagging scheme, they still rely on post-processing rules to guarantee the extracted span set to be valid. Moreover, since span-based models are locally normalized at span level, they potentially suffer from the label bias problem (Lafferty et al., 2001) . This paper seeks to provide a new framework which infers the segmentation of a unit sequence by directly selecting from all valid segmentation candidates, instead of manipulating tokens or spans. To this end, we propose Lexical Unit Analysis (LUA) in this paper. LUA assigns a score to every valid segmentation candidate and leverages dynamic programming (DP) (Bellman, 1966) to search for the maximum scoring one. The score of a segmentation is computed by using the scores of its all segments. Besides, we adopt neural networks to score every segment of the input sentence. The purpose of using DP is to solve the intractability of extracting the maximum scoring segmentation candidate by brute-force search. The time complexity of LUA is quadratic time, yet it can be optimized to linear time in practice by performing parallel matrix computations. For training criterion, we incur a hinge loss between the ground truth and the predictions. We also extend LUA to unlabeled segmentation and capturing label correlations. Figure 1 illustrates the comparison between previous methods and the proposed LUA. Prior models at token level and span level are vulnerable to generating invalid predictions, and hence rely on heuristic rules to fix them. For example, in the middle part of Figure 1 , the spans of two inferred named entities, [Word Cup] MISC and [Cup] MISC , conflicts, which is mitigated by comparing the predicted scores. LUA scores all possible segmentation candidates and uses DP to extract the maximum scoring one. In this way, our models guarantee the predictions to be valid. Moreover, the globality of DP addresses the label bias problem. Extensive experiments are conducted on syntactic chunking, NER, slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging across 15 tasks. We have obtained new stateof-the-art results on 13 of them and performed competitively on the others. In particular, we observe that LUA is expert at identifying long-length segments.

2. METHODOLOGY

We denote an input sequence (i.e., fine-grained lexical units) as x = [x 1 , x 2 , • • • , x n ], where n is the sequence length. An output sequence (i.e., coarse-grained lexical units) is represented as the segmentation y = [y 1 , y 2 , • • • , y m ] with each segment y k being a triple (i k , j k , t k ). m denotes its length. (i k , j k ) specifies a span that corresponds to the phrase Start-of-sentence symbol [SOS] is added in the pre-processing stage. x i k ,j k = [x i k , x i k +1 , • • • , x j k ].

2.1. MODEL: SCORING SEGMENTATION CANDIDATES

We denote Y as the universal set that contains all valid segmentation candidates. Given one of its members y ∈ Y, we compute the score f (y) as f (y) = (i,j,t)∈y s c i,j + s l i,j,t ,



Figure 1: A toy example to show LUA and how it differs from prior methods. The items in blue and red respectively denote valid and invalid predictions.

t k is a label from the label space L. We define a valid segmentation candidate as its segments are non-overlapping and fully cover the input sequence. A case extracted from CoNLL-2003 dataset (Sang & De Meulder, 2003): x = [[SOS], Sangthai, Glory, 22/11/96, 3000, Singapore] y = [(1, 1, O), (2, 3, MISC), (4, 4, O), (5, 5, O), (6, 6, LOC)].

