SEGMENTING NATURAL LANGUAGE SENTENCES VIA LEXICAL UNIT ANALYSIS

Abstract

In this work, we present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks. Given a natural language sentence, LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximum scoring one. LUA enjoys a number of appealing properties such as inherently guaranteeing the predicted segmentation to be valid and facilitating globally optimal training and inference. Besides, the practical time complexity of LUA can be reduced to linear time, which is very efficient. We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our models have achieved the state-of-the-art performances on 13 of them. The results also show that the F1 score of identifying long-length segments is notably improved.

1. INTRODUCTION

Sequence segmentation is essentially the process of partitioning a sequence of fine-grained lexical units into a sequence of coarse-grained ones. In some scenarios, each composed unit is assigned a categorical label. For example, Chinese word segmentation splits a character sequence into a word sequence (Xue, 2003) . Syntactic chunking segments a word sequence into a sequence of labeled groups of words (i.e., constituents) (Sang & Buchholz, 2000) . There are currently two mainstream approaches to sequence segmentation. The most common is to regard it as a sequence labeling problem by using IOB tagging scheme (Mesnil et al., 2014; Ma & Hovy, 2016; Liu et al., 2019b; Chen et al., 2019a; Luo et al., 2020) . A representative work is Bidirectional LSTM-CRF (Huang et al., 2015) , which adopts LSTM (Hochreiter & Schmidhuber, 1997) to read an input sentence and CRF (Lafferty et al., 2001) to decode the label sequence. This type of method is very effective, providing tons of state-of-the-art performances. However, it is vulnerable to producing invalid labels, for instance, "O, I-tag, I-tag". This problem is very severe in low resource settings (Peng et al., 2017) . In experiments (see section 4.6), we also find that it performs poorly in recognizing long-length segments. Recently, there is a growing interest in span-based models (Zhai et al., 2017; Li et al., 2019; Yu et al., 2020) . They treat a span rather than a token as the basic unit for labeling. Li et al. (2019) cast named entity recognition (NER) to a machine reading comprehension (MRC) task, where entities are extracted as retrieving answer spans. Yu et al. (2020) rank all the spans in terms of the scores predicted by a bi-affine model (Dozat & Manning, 2016) . In NER, span-based models have significantly outperformed their sequence labeling based counterparts. While these methods circumvent the use of IOB tagging scheme, they still rely on post-processing rules to guarantee the extracted span set to be valid. Moreover, since span-based models are locally normalized at span level, they potentially suffer from the label bias problem (Lafferty et al., 2001) . This paper seeks to provide a new framework which infers the segmentation of a unit sequence by directly selecting from all valid segmentation candidates, instead of manipulating tokens or spans. To this end, we propose Lexical Unit Analysis (LUA) in this paper. LUA assigns a score to every valid segmentation candidate and leverages dynamic programming (DP) (Bellman, 1966) to search for the maximum scoring one. The score of a segmentation is computed by using the scores of its all segments. Besides, we adopt neural networks to score every segment of the input sentence. The purpose of using DP is to solve the intractability of extracting the maximum scoring segmentation candidate by brute-force search. The time complexity of LUA is quadratic time, yet it can be optimized to linear time in practice by performing parallel matrix computations. For training criterion, we incur a hinge loss between the ground truth and the predictions. We also extend LUA to unlabeled segmentation and capturing label correlations. Figure 1 illustrates the comparison between previous methods and the proposed LUA. Prior models at token level and span level are vulnerable to generating invalid predictions, and hence rely on heuristic rules to fix them. For example, in the middle part of Figure 1 , the spans of two inferred named entities, [Word Cup] MISC and [Cup] MISC , conflicts, which is mitigated by comparing the predicted scores. LUA scores all possible segmentation candidates and uses DP to extract the maximum scoring one. In this way, our models guarantee the predictions to be valid. Moreover, the globality of DP addresses the label bias problem. Extensive experiments are conducted on syntactic chunking, NER, slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging across 15 tasks. We have obtained new stateof-the-art results on 13 of them and performed competitively on the others. In particular, we observe that LUA is expert at identifying long-length segments.

2. METHODOLOGY

We denote an input sequence (i.e., fine-grained lexical units) as x = [x 1 , x 2 , • • • , x n ], where n is the sequence length. An output sequence (i.e., coarse-grained lexical units) is represented as the segmentation y = [y 1 , y 2 , • • • , y m ] with each segment y k being a triple (i k , j k , t k ). m denotes its length. (i k , j k ) specifies a span that corresponds to the phrase Start-of-sentence symbol [SOS] is added in the pre-processing stage. x i k ,j k = [x i k , x i k +1 , • • • , x j k ].

2.1. MODEL: SCORING SEGMENTATION CANDIDATES

We denote Y as the universal set that contains all valid segmentation candidates. Given one of its members y ∈ Y, we compute the score f (y) as f (y) = (i,j,t)∈y s c i,j + s l i,j,t , Algorithm 1: Inference via Dynamic Programming (DP) Input: Composition score s c i,j and label score s l i,j,t for every possible segment (i, j, t). Output: The maximum segmentation scoring candidate ŷ and its score f (ŷ). Set two n × n shaped matrices, c L and b c , for computing maximum scoring labels. Set two n-length vectors, g and b g , for computing maximum scoring segmentation. for 1 ≤ i ≤ j ≤ n do Compute the maximum label score for each span (i, j): s L i,j = max t∈L s l i,j,t . Record the backtracking index: b c i,j = arg max t∈L s l i,j,t . Initialize the value of the base case x 1,1 : g 1 = s c 1,1 + s L 1,1 . for i ∈ [2, 3, • • • , n] do Compute the value of the prefix x 1,i : g i = max 1≤j≤i-1 g i-j + (s c i-j+1,i + s L i-j+1,i ) . Record the backtracking index: b g i = arg max 1≤j≤i-1 g i-j + (s c i-j+1,i + s L i-j+1,i ) . Get the maximum scoring candidate ŷ by back tracing the tables b g and b c . Get the maximum segmentation score: f (ŷ) = g n . where s c i,j is the composition score to estimate the feasibility of merging several fine-grained units [x i , x i+1 , • • • , x j ] into a coarse-grained unit and s l i,j,t is the label score to measure how likely the label of this segment is t. Both scores are obtained by a scoring model. Scoring Model. a scoring model scores all possible segments (i, j, t) for an input sentence x. Firstly, we get the representation for each fine-grained unit. Following prior works (Li et al., 2019; Luo et al., 2020; Yu et al., 2020) , we adopt BERT (Devlin et al., 2018) , a powerful pre-trained language model, as the sentence encoder. Specifically, we have [h w 1 , h w 2 • • • , h w n ] = BERT(x), Then, we compute the representation for a coarse-grained unit x i,j , 1 ≤ i ≤ j ≤ n as h p i,j = h w i ⊕ h w j ⊕ (h w i -h w j ) ⊕ (h w i h w j ), where ⊕ is vector concatenation and is element-wise product. Eventually, we employ two non-linear feedforward networks to score a segment (i, j, t): s c i,j = v c T tanh(W c h p i,j ), s l i,j,t = v l t T tanh(W l h p i,j ), where v c , W c , v l t , t ∈ L, and W l are all learnable parameters. Besides, the scoring model used here can be flexibly replaced by any regression method.

2.2. INFERENCE VIA DYNAMIC PROGRAMMING

The prediction of the maximum scoring segmentation candidate can be formulated as ŷ = arg max y∈Y f (y). (5) Because the size of search space |Y| increases exponentially with respect to the sequence length n, brute-force search to solve Equation 5 is computationally infeasible. LUA uses DP to address this issue, which is facilitated by the decomposable nature of Equation 1. DP is a well-known optimization method which solves a complicated problem by breaking it down into simpler sub-problems in a recursive manner. The relation between the value of the larger problem and the values of its sub-problems is called the Bellman equation.

Sub-problem.

In the context of LUA, the sub-problem of segmenting an input unit sequence x is segmenting its prefixes x 1,i , 1 ≤ i ≤ n. We define g i as the maximum segmentation score of the prefix x 1,i . Under this scheme, we have max y∈Y f (y) = g n . The Bellman Equation. The relatinship between segmenting a sequence x 1,i , i > 1 and segmenting its prefixes x 1,i-j , 1 ≤ j ≤ i -1 is built by the last segments (i -j + 1, i, t): g i = max 1≤j≤i-1 g i-j + (s c i-j+1,i + max t∈L s l i-j+1,i,t ) . In practice, to reduce the time complexity of above equation, the last term is computed beforehand as s L i,j = max t∈L s l i,j,t , 1 ≤ i ≤ j ≤ n. Hence, Equation 6 is reformulated as g i = max 1≤j≤i-1 g i-j + (s c i-j+1,i + s L i-j+1,i ) . ( ) The base case is the first token x 1,1 = [[SOS]]. We get its score g 1 as s c 1,1 + s L 1,1 . Algorithm 1 shows how DP is applied in inference. Firstly, we set two matrices and two vectors to store the solutions to the sub-problems (1-st to 2-nd lines). Secondly, we get the maximum label scores for all the spans (3-rd to 5-th lines). Then, we initialize the trivial case g 1 and recursively calculate the values for prefixes x 1,i , i > 1 (6-th to 9-th lines). Finally, we get the predicted segmentation ŷ and its score f (ŷ) (10-th to 11-th lines). The time complexity of Algorithm 1 is O(n 2 ). By performing the max operation of Equation 7in parallel on GPU, it can be optimized to only O(n), which is highly efficient. Besides, DP, as the backbone of the proposed model, is non-parametric. The trainable parameters only exist in the scoring model part. These show LUA is a very light-weight algorithm.

2.3. TRAINING CRITERION

We adopt max-margin penalty as the loss function for training. Given the predicted segmentation ŷ and the ground truth segmentation y * , we have J = max 0, 1 -f (y * ) + f (ŷ) .

3. EXTENSIONS OF LUA

We propose two extensions of LUA for generalizing it to different scenarios. Unlabeled Segmentation. In some tasks (e.g., Chinese word segmentation), the segments are unlabeled. Under this scheme, the Equation 1 and Equation 7 are reformulated as f (y) = (i,j)∈y s c i,j , g i = max 1≤j≤i-1 (g i-j + s c i-j+1,i ). Capturing Label Correlations. In some tasks (e.g., syntactic chunking), the labels of segments are strongly correlated. To incorporate this information, we redefine f (y) as f (y) = 1≤k≤m s c i k ,j k + s l i k ,j k ,t k + 1≤k≤m s d t k-q+1 ,t k-q+2 ,••• ,t k . ( ) Score s d t k-q+1 ,t k-q+2 ,••• ,t k models the label dependencies among q successive segments, y k-q+1,k . In practice, we find q = 2 balances the efficiency and the effectiveness well, and thus parameterize a learnable matrix W d ∈ R |V|×|V| to implement it. The corresponding Bellman equation to above scoring function is g i,t = max 1≤j≤i-1 max t ∈L (g i-j,t + s d t ,t ) + (s c i-j+1,i + s l i-j+1,i,t ) , where g i,t is the maximum score of labeling the last segment of the prefix x 1,i with t. For initialization, we set the value of g d 1,O as 0 and the others as -∞. By performing the inner loops of two max operations in parallel, the practical time complexity for computing g i,t , 1 ≤ i ≤ n, t ∈ L is also O(n). Ultimately, the segmentation score f (ŷ) is obtained by max t∈L g n,t . This extension further improves the results on syntactic chunking and Chinese POS tagging, as both tasks have rich sequential features among the labels of segments. 

4. EXPERIMENTS

We have conducted extensive studies on 5 tasks, including Chinese word segmentation, Chinese POS tagging, syntactic chunking, NER, and slot filling, across 15 datasets. Firstly, Our models have achieved new state-of-the-art performances on 13 of them. Secondly, the results demonstrate that the F1 score of identifying long-length segments has been notably improved. Lastly, we show that LUA is a very efficient algorithm concerning the running time.

4.1. SETTINGS

We use the same configurations for all 15 datasets. L2 regularization and dropout ratio are respectively set as 1 × 10 -6 and 0.2 for reducing overfit. We use Adam (Kingma & Ba, 2014) to optimize our model. Following prior works, BERT BASE is adopted as the sentence encoder. We use uncased BERT BASE for slot filling, Chinese BERT BASE for Chinese tasks (e.g., Chinese POS tagging), and cased BERT BASE for others (e.g., syntactic chunking). In addition, the improvements of our model over baselines are statistically significant with p < 0.05 under t-test.

4.2. CHINESE WORD SEGMENTATION

Chinese word segmentation splits a Chinese character sequence into a sequence of Chinese words. We use SIGHAN 2005 bake-off (Emerson, 2005) and Chinese Treebank 6.0 (CTB6) (Xue et al., 2005) (Meng et al., 2019) or POS tags (Yang et al., 2017) . Despite this, our model is still competitive with Glyce + BERT on MSR.

4.3. CHINESE POS TAGGING

Chinese POS tagging jointly segments a Chinese character sequence and assigns a POS tag to each segmented unit. We use Chinese Treebank 5.0 (CTB5), CTB6, Chinese Treebank 9.0 (CTB9) (Xue Model Chunking NER CoNLL-2000 CoNLL-2003 OntoNotes 5.0 Bi-LSTM + CRF (Huang et al., 2015) 94.46 90.10 -Flair Embeddings (Akbik et al., 2018) 96.72 93.09 89.3 GCDT w/ BERT (Liu et al., 2019b) 96.81 93.23 -BERT-MRC (Li et al., 2019) -93.04 91.11 HCR w/ BERT (Luo et al., 2020) -93.37 90.30 BERT-Biaffine Model (Yu et al., 2020 et al., 2005) , and the Chinese section of Universal Dependencies 1.4 (UD1) (Nivre et al., 2016) . CTB5 is comprised of newswire data. CTB9 consists of source texts in various genres, which cover CTB5. we convert the texts in UD1 from traditional Chinese into simplified Chinese. We follow the same train/dev/test split for above datasets as in Shao et al. (2017) . Table 2 shows the experiment results. The performances of all baselines are reported from Meng et al. (2019) . Our model LUA w/ Label Correlations has yielded new state-of-the-art results on all the datasets: it improves the F1 scores by 1.35% on CTB5, 1.22% on CTB6, 0.8% on CTB9, and 0.94% on UD1. Moreover, the basic LUA without capturing the label correlations also outperforms the strongest baseline, Glyce + BERT, by 0.18% on CTB5 and 0.07% on CTB9. All these facts further verify the effectiveness of LUA and its extension.

4.4. SYNTACTIC CHUNKING AND NER

Syntactic chunking aims to find phrases related to syntatic category for a sentence. , Percent, etc.) . We follow the same format and partition as in Li et al. (2019) ; Luo et al. (2020) ; Yu et al. (2020) . In order to fairly compare with previous reported results, we convert the predicted segments into IOB format and utilize conlleval scriptfoot_0 to compute the F1 score at test time. Table 3 shows 

4.5. SLOT FILLING

Slot filling, as an important task in spoken language understanding (SLU), extracts semantic constituents from an utterance. We use ATIS dataset (Hemphill et al., 1990) , SNIPS dataset (Coucke et al., 2018) , and MTOD dataset (Schuster et al., 2018) . ATIS dataset consists of audio recordings of Model ATIS SNIPS MTOD Slot-Gated SLU (Goo et al., 2018) 95.20 88.30 95.12 Bi-LSTM + EMLo (Siddhant et al., 2019) 95.42 93.90 -Joint BERT (Chen et al., 2019b) 96.10 97.00 96.48 CM-Net (Liu et al., 2019c) 96 Goo et al. (2018) ; Schuster et al. (2018) . Table 4 summarizes the experiment results for slot filling. On ATIS and SNIPS, we take the results of all baselines as reported in Liu et al. (2019c) for comparison. On MTOD, we rerun the open source toolkits, Slot-gated SLUfoot_2 and Joint BERTfoot_3 . As all previous approaches jointly model slot filling and intent detection (a classification task in SLU), we follow them to augment LUA with intent detection for a fair comparison. As shown in Table 4 , the augmented LUA has surpassed all baselines and obtained state-of-the-art results on the three datasets: it increases the F1 scores by around 0.05% on ATIS and SNIPS, and delivers a substantial gain of 1.11% on MTOD. It's worth mentioning that LUA even outperforms the strong baseline Joint BERT with a margin of 0.18% and 0.21% on ATIS and SNIPS without modeling intent detection.

4.6. LONG-LENGTH SEGMENT IDENTIFICATION

Since LUA doesn't resort to IOB tagging scheme, it should be more accurate in recognizing longlength segments than prior methods. To verify this intuition, we evaluate different models on the segments of different lengths. This study is investigated on OntoNotes 5.0 dataset. Two strong models are adopted as the baselines: one is the best sequence labeling model (i.e., HCR) and the other is the best span-based model (i.e., BERT-Biaffine Model). Both baselines are reproduced by rerunning their open source codes, biaffine-nerfoot_4 and Hire-NERfoot_5 . The results are shown in Table 5 . On the one hand, both LUA and Biaffine Model obtain much higher scores of extracting long-length entities than HCR. For example, LUA outperforms HCR w/ BERT by almost twofold on range 12 -24. On the other hand, LUA achieves even better results than BERT-Biaffine Model. For instance, the F1 score improvements of LUA over it are 10.11% on range 8 -11 and 41.23% on range 12 -24. the success of our models in performances does not lead to serious side-effects on efficiency. For example, with the same practical time complexity, BERT + CRF is slower than the proposed LUA by 15.01% and LUA w/ Label Correlations by 5.30%.

5. RELATED WORK

Sequence segmentation aims to partition a fine-grained unit sequence into multiple labeled coarsegrained units. Traditionally, there are two types of methods. The most common is to cast it into a sequence labeling task (Mesnil et al., 2014; Ma & Hovy, 2016; Chen et al., 2019a) (Sarawagi & Cohen, 2005) that improves CRF at phrase level. However, the computation of CRF loss is costly in practice and the potential to model the label dependencies among segments is limited. An alternative approach that is less studied uses a transition-based system to incrementally segment and label an input sequence (Zhang et al., 2016; Lample et al., 2016) . For instance, Qian et al. (2015) present a transition-based model for joint word segmentation, POS tagging, and text normalization. Wang et al. (2017) employ a transitionbased model to disfluency detection task, which helps capture non-local chunk-level features. These models have many advantages like theoretically lower time complexity and labeling the extracted mentions at span level. However, to our best knowledge, no recent transition-based models surpass their sequence labeling based counterparts. More recently, there is a surge of interests in span-based models. They treat a segment, instead of a fine-grained token, as the basic unit for labeling. For example, Li et al. (2019) regard NER as a MRC task, where entities are recognized as retrieving answer spans. Since these methods are locally normalized at span level rather than sequence level, they potentially suffer from the label bias problem. Additionally, they rely on rules to ensure the extracted span set to be valid. Spanbased methods also emerge in other fields of NLP. In dependency parsing, Wang & Chang (2016) propose a LSTM-based sentence segment embedding method named LSTM-Minus. Stern et al. (2017) integrate LSTM-minus feature into constituent parsing models. In coreference resolution, Lee et al. (2018) consider all spans in a document as the potential mentions and learn distributions over all the possible antecedents for each other.

6. CONCLUSION

This work proposes a novel LUA for general sequence segmentation tasks. LUA directly scores all the valid segmentation candidates and uses dynamic programming to extract the maximum scoring one. Compared with previous models, LUA naturally guarantees the predicted segmentation to be valid and circumvents the label bias problem. Extensive studies are conducted on 5 tasks across 15 datasets. We have achieved the state-of-the-art performances on 13 of them. Importantly, the F1 score of identifying long-length segments is significantly improved.



https://www.clips.uantwerpen.be/conll2000/chunking/conlleval.txt. https://github.com/Adaxry/GCDT. https://github.com/MiuLab/SlotGated-SLU. https://github.com/monologg/JointBERT. https://github.com/juntaoy/biaffine-ner. https://github.com/cslydia/Hire-NER.



Figure 1: A toy example to show LUA and how it differs from prior methods. The items in blue and red respectively denote valid and invalid predictions.

the results. Most of baselines are directly taken fromAkbik et al. (2018);Li et al. (2019);Luo et al. (2020);Yu et al. (2020). Besides, followingLuo et al. (2020), we rerun the source code 2 of GCDT and report its result on CoNLL-2000 with standard evaluation method. Generally, our proposed models LUA w/o Label Correlations yield competitive performance over state-of-theart models on both Chunking and NER tasks. Specifically, regarding to the NER task, on CoNLL-2003 dataset our model LUA outperforms several strong baselines including Flair Embedding, and it is comparable to the state-of-the-art model (i.e., BERT-Biaffine Model). In particular, on OntoNotes dataset, LUA outperforms it by 0.79% points and establishes a new state-of-the-art result. Regarding to the Chunking task, LUA advances the best model (GCDT) and the improvements are further enlarged to 0.42% points by LUA w/ Label Correlations.

Experiment results on Chinese word segmentation.

Experiment results on the four datasets of Chinese POS tagging.

. SIGHAN 2005 back-off consists of 5 datasets, namely AS, MSR, CITYU, and PKU. FollowingMa et al. (2018), we randomly select 10% training data as development set. We convert all digits, punctuation, and Latin letters to half-width for handling full/half-width mismatch between training and test set. We also convert AS and CITYU to simplified Chinese. For CTB6, we follow the same format and partition as inYang et al. (2017);Ma et al. (2018).

Experiment results on syntactic chunking and NER.

Experiment results on the three datasets of slot filling.

The F1 scores for NER models on different segment lengths. A -B(N ) denotes that there are N entities whose span lengths are between A and B.people making flight reservations. The training set contains 4478 utterances and the test set contains 893 utterances. SNIPS dataset is collected by Snips personal voice assistant. The training set contains 13084 utterances and the test set contains 700 utterances. MTOD dataset has three domains, including Alarm, Reminder, and Weather. We use the English part of MTOD dataset, where training set, dev set, and test set respectively contain 30521, 4181, and 8621 utterances. We follow the same partition of above datasets as in

shows the running time comparison among different models. The middle two columns are the time complexity of decoding a label sequence. The last column is the time cost of one epoch in training. We set the batch size as 16 and run all the models on 1 GPU. The results indicate that

Running time comparison on the syntactic chunking dataset.

by using IOB tagging scheme. This method is simple and effective, providing a number of state-of-the-art results.Akbik et al. (2018) present Flair Embeddings that pretrain character embedding in a large corpus and directly use it, instead of word representation, to encode a sentence.Liu et al. (2019b)   introduce GCDT that deepens the state transition path at each position in a sentence, and further assigns each word with global representation.Luo et al. (2020) use hierarchical contextualized representations to incorporate both sentence-level and document-level information. Nevertheless, these models are vulnerable to producing invalid labels and perform poorly in identifying longlength segments. This problem is very severe in low-resource setting.Ye & Ling (2018);Liu  et al. (2019a)  adopt Semi-Markov CRF

