FILTERED SEMI-MARKOV CRF

Abstract

Semi-Markov CRF (Sarawagi and Cohen, 2005) has been proposed as an alternative to the traditional Linear Chain CRF (Lafferty et al., 2001) for text segmentation tasks such as Named Entity Recognition. In contrast to CRF, which treats text segmentation as token-level prediction, Semi-CRF considers spans as the task's basic unit, which makes it more expressive. However, Semi-CRF has two major drawbacks: (1) it has quadratic complexity over sequence length as it operates on every span of the input sequence, and (2) empirically, it performs worse than classical CRF for sequence labeling tasks such as NER. In our work, we propose Filtered Semi-Markov CRF, a Semi-CRF variant that addresses the aforementioned issues. Our model extends Semi-CRF by incorporating a filtering step for eliminating irrelevant segments, which helps reduce the complexity and dramatically reduce the search space. On a variety of NER benchmarks, we find that our approach outperforms both CRF and Semi-CRF models while being significantly faster. We will make our code available to the public.

1. INTRODUCTION

Sequence segmentation is the process of dividing a sequence into several distinct, non-overlapping segments to cover the entire sequence (Sarawagi and Cohen, 2005; Terzi, 2006) . It has a wide range of use cases, including Named Entity Recognition (Tjong Kim Sang and De Meulder, 2003) and Chinese Word Segmentation (Li and Yuan, 1998) . Sequence segmentation has traditionally been seen as a sequence labeling problem using pre-existing templates such as BIO and BILOU schemes (Ratinov and Roth, 2009) . Conditional Random Field (CRF) (Lafferty et al., 2001) has been widely used in sequence labeling problems to model the dependence between adjacent token tags. Although the Linear-chain CRF has performed well in various segmentation tasks, operating at the segment level rather than the token level would be a more natural way to perform sequence segmentation. To this end, the Semi-Markov CRF (Sarawagi and Cohen, 2005) has been proposed as a variant of CRF, allowing for the incorporation of higher-level segment features, such as segment width. However, Semi-CRF, unlike CRF, is considerably slower for both learning and inference due to its quadratic complexity with respect to the sequence length. Moreover, Semi-CRF generally performs worse than CRF (sometimes the Semi-CRF performs better but the gain is only marginal) (Liang, 2005; Daumé and Marcu, 2005; Andrew, 2006) . Indeed, the Semi-CRF performs joint segmentation and labeling which results in a much larger search space making learning more challenging. To address this problem, we propose a filtered version of the Semi-CRF. Like Semi-CRF (Sarawagi and Cohen, 2005) , our model operates on segments, but we add a filtering model to discard a large number of candidate segments. Our aim is to reduce the computational complexity by pruning the segmentation search space. During inference and after the filtering step, finding the best segmentation and labeling boils down to finding the maximum scoring path in a weighted directed acyclic graph. During training we use a similar dynamic programming algorithm allowing us to sum over all paths in the graph. We evaluate our approach on benchmark datasets for Named Entity Recognition and find that it performs better than CRF and Semi-CRF models with noticeably faster inference. The rest of this paper is organized as follows. In the next section, we provide some background and context for understanding the foundational CRF and Semi-CRF models. Next, we present our filtered Semi-CRF model in detail, followed by the experimental setup, the results and further experimental analysis, and an overview of related works. The final section concludes this paper.

2. BACKGROUND

In this section, we first present the Linear-Chain CRF (Lafferty et al., 2001) and then the semi-Markov CRF (Sarawagi and Cohen, 2005) , namely their structured representation and their learning and inference algorithms.

2.1. LINEAR CHAIN CRF

The Linear-Chain CRF (Lafferty et al., 2001) is a sequence labeling model that assigns a label to each token in the input sequence. It assumes dependencies between adjacent output labels (typically a Markov dependency of order 1). Hence, given an input sequence x, a sequence of labels y of the same size L is produced with y i ∈ Y . The conditional probability of y given x is computed using the following estimator: p(y|x) = exp L i=1 ψ(y i |x) + L i=2 T yi-1,yi Z(x) = exp Ψ(y|x) Z(x) where ψ(y i |x) ∈ R is the score of the sequence label at position i and T ∈ R |Y |×|Y | is a learnable label transition matrix defined for each pair of labels. Furthermore, Z(x) = y ′ ∈Y(x) exp Ψ(y|x) is the partition function that serves as a normalizer of the probability distribution, where Y(x) is the set of all possible label sequences admissable for x. During training, the goal is to update all model parameters by minimizing the negative log probabilities of the gold labels: -log p(y * |x) = -Ψ(y * |x) + log Z(x). The partition function Z(x) is computed in polynomial time using the Forward algorithm (See Eq. 13 in Appendix A.2 for details). For inference, the goal is to produce the optimal segmentation y * = argmax y Ψ(y|x), which is computed using the Viterbi algorithm (Eq. 14 in Appendix A.2). The CRF has linear complexity in terms of the sequence length L, and quadratic complexity in terms of the number of labels |Y | for both learning and inference, i.e., O(L|Y | 2 ).

2.2. SEMI-MARKOV CRF

Unlike the Linear-chain CRF, the Semi-CRF (Sarawagi and Cohen, 2005) operates at the segment level to account for segment features that cannot be easily modeled using sequence labeling. The Semi-CRF produces a segmentation y (of size M ) of input sequence x (of size L, with L ≥ M ). The conditional probability of the labeled segmentation y given an input x is computed as follows: Sarawagi and Cohen (2005) , a labeled segmentation y = {s 1 , . . . , s M } ∈ Y(x) has the following properties:  p(y|x) = exp M k=1 ϕ(s k |x) + T [l k-1 , l k ] Z(x) = exp Φ(y|x) Z(x) (2) ϕ(s k |x) ∈ R is the score of the k-th segment of y and T [l k-1 , l k ] is the label transition score with T [l 0 , l 1 ] = 0. Furthermore, following • A segment s k = (i k , j k , l k ) ∈ i 1 = 1, j M = L, 1 ≤ i k ≤ j k ≤ L, and i k+1 = j k + 1. For instance, for Named Entity Recognition, a segmentation of the sentence "Michael Jordan eats an apple ." would be Y =[(1, 2, PER), (3, 3, O), (4, 4, O), (5, 5, O), (6, 6, O)]. In (Sarawagi and Cohen, 2005) , it is always assumed that non-entity segments (also O or null segments) have unit length. The model parameters are learned to maximize the conditional probability of gold segmentation p(y|x) over the training data, similar to CRF. The partition function Z(x) = y ′ ∈Y(x) exp Φ(y|x) can be computed in polynomial time using a modification of the Forward algorithm (Eq. 15 in Appendix A.3), and inference is done by segmental Viterbi (Eq. 16 in Appendix A.3) to produce the best segmentation y * = argmax y Φ(y|x). Finally, the Semi-CRF has quadratic complexity in terms of both sequence length and the number of labels for both learning and inference, i.e., O(L 2 |Y | 2 ).

2.3. GRAPH-BASED FORMULATION OF SEMI-CRF

Given a sequence x of lenght |x| = L, a labeled segment s k = (i k , j k , l k ) is defined by its start and end positions 1 ≤ i k ≤ j k ≤ L and its label l k ∈ Y . Let G(V, E) be a directed graph, whose set of nodes V is made of all segments x, with |x| = L: V = L i=1 L j=i |Y | l=1 {(i, j, l)}, and the directed edge s k ′ → s k ∈ E if and only if j k ′ + 1 = i k . We further define the weight of an edge s k ′ → s k as follows: w(s k ′ → s k |x) = ϕ(s k |x) + T [l k ′ , l k ] where ϕ(s k |x) is the score of the segment s k and T [l k ′ , l k ] is the label transition score. Proposition 1. Any directed path {s 1 , s 2 , . . . , s M } in the graph verifying i 1 = 1 and j M = L corresponds to a segmentation of x. Proof. Any directed path {s 1 , s 2 , . . . , s M } verify the properties of the segmentation described in section 2.2, namely i 1 = 1, j M = L, 1 ≤ i k ≤ j k ≤ L, and j k + 1 = i k+1 (by definition). In addition, the score of the path {s 1 , s 2 , . . . , s M } computed as the sum of the edge scores is equivalent to the Semi-CRF score (2.2) of the segmentation y = {s 1 , s 2 , . . . , s M }: score(s 1 , s 2 , . . . , s M ) = M k=1 w(s k-1 → s k |x) = M k=1 ϕ(s k |x) + T [l k-1 , l k ] = Φ(y = {s 1 , . . . , s M }|x) The search for the best segmentation consists in finding the maximal weighted path of the graph that begins at i 1 = 1 and end at j M = L. Finding the best path in this graph has a complexity of L 3 using a generic search algorithm such as Bellman-Ford (see section 3.3 for details). Nevertheless, taking into account the lattice structure of the problem allows reducing the complexity to L 2 , as is done in the Viterbi algorithm (Viterbi, 1967) .

3. FILTERED SEMI-MARKOV CRF

We describe in this section our proposed alternative to Semi-CRF, which we term Filtered Semi-CRF. The motivations for this new model is to address two weaknesses of the Semi-CRF. First, the Semi-CRF is not well-suited for long texts due to its quadratic complexity and the search space is prohibitively large. Second, in tasks such as NER where some segments should be labeled null, multiple paths in the Semi-CRF graph can produce the same set of entities. This is because long null segments can be broken into smaller contiguous null segments without modifying the result. In fact, Sarawagi and Cohen (2005) constrains null segments to have a unit length and assigns them a score. The crux of our approach is to use an independent model to filter the Semi-CRF graph described in § 2.3 prior to further computations. The resulting graph is order of magnitude smaller than the original one and does not contain null segments thus addressing both issues.

3.1. FILTERING

In our model, filtering is applied to the full set of segments (we denote as V f ull ). The filtering eliminates the segments that are predicted to be null segments by means of a local classifier ϕ local : V = s k ∈ V f ull | arg max l k ϕ local (s k = (i k , j k , l k )|x) ̸ = null (6) Since the filtered nodes V may not contain all segments, defining the edges E as we did in 2.3 would not be applicable here. Thus, we propose to define the edges using the method of Liang et al. (1, 1) (1, 2) null segments are dropped (Eq. 6) using a local segment classifier ϕ local . 3) Construct the path graph from the filtered segments; we omit the transition scores for better readability. 4) During training, we compute the loss function (Eq. 9 and 10) by constraining the gold path y * to be a path of the graph, and during inference, return the maximum weighted path (Alg. 2). Please note that the size of the graph can vary a lot depending on the input sequence and training stage (Fig. 2 and 3 ). (1, 3) … (L, L) (1, 2) 0.6 (1, 3) -2.1 (3, 4) 1.4 (4, 5) 0.7 start (1, 2) (1, 3) (3, 4) (4, ( ): ∀(s k ′ , s k ) ∈ V 2 , s k ′ → s k ∈ E if j k ′ < i k and there is no s k * ∈ V such that j k ′ < i k * and i k * < j k . This formulation means that s k ′ → s k is an edge if s k begins after s k ′ , and that no other segment lies completely inside (j k ′ , i k ). This formulation generalizes the Semi-CRF to graphs with missing segments. However, when segments are missing, the starting and ending of segmentation are not necessarily i 1 = 1 and j M = L. To fix this problem, we simply add two terminal nodes start and end: • start → s k ∈ E if s k ′ → s k ̸ ∈ E for all k ′ ̸ = start • s k → end ∈ E if s k → s k ′ ̸ ∈ E for all k ′ ̸ = end A segmentation in the graph is a path from start to end, i.e., {s 0 , s 1 , . . . , s M , s M +1 } with s 0 = start and s M +1 = end.foot_0 An illustration of the graph construction is shown in the figure 1 . For named entity recognition, if we take again the example of Section 2.2, the correct segmentation of "Michael Jordan eats an apple." using the Filtered Semi-CRF would be y=[start, (1, 2, PER), end], the remaining segments being considered as null label: in fact, the Filtered Semi-CRF only accounts for entity segments and assumes that the remaining parts of the sequence have the null label.

3.2. SCORING, LEARNING AND INFERENCE

Segmentation probability To compute a segmentation score in the filtered graph, we sum the weights of the path edges representing the segmentation as for the Semi-CRF described in Section 2.3: score(y = {s 0 , . . . , s M +1 }|x) = s k ∈y w(s k-1 → s k |x) (7) where w(s k-1 → s k |x) = ϕ global (s k |x) + T [l k-1 , l k ] if k ̸ ∈ {1, M + 1} and w(s 0 → s 1 ) = ϕ global (s 1 |x) and w(s M → s M +1 ) = 0 , where s 0 = start and s M +1 = end. Note that the start and end nodes are added only to make the problem a single-source, single-destination shortest path problem. Moreover, ϕ global is a neural network, similarly to ϕ local , it takes as input the labeled segments s k = (i k , j k , l k ) ∈ V and returns their scores. Finally, the segmentation probability of the Filtered Semi-CRF is: p(y = {s 0 , . . . , s M +1 }|x) = exp score(y|x) Z(x) Z(x) = y ′ ∈Y(x) exp score(y ′ |x) is the partition function, which makes the probabilities of all segmentation sum to one. The set Y(x) contains all paths in the graph from start to end. For a reasonably small graphs, Y(x) can be enumerated, but this is intractable for larger graphs. The partition function can be efficiently computed without enumeration; with dynamic programming using a variant of the Bellman-Ford algorithm, which can be seen as a message-passing algorithm (Wainwright and Jordan, 2008) : Algorithm 1 Computing Z(x) 1: Topologically sort the nodes of V 2: α[start] = 1 and α[k] = 0 otherwise for k ∈ V 3: for all k ̸ = start in V do 4: for all k ′ such that k ′ → k ∈ E do 5: α[k] ← α[k] + α[k ′ ] exp{w(s k ′ → s k )|x} 6: end for 7: end for 8: Z(x) = α[end] In practice, this implementation of Z(x) is unstable, so we did all the computations in the log space to prevent overflow/underflow. The complexity of the algorithm is O(|V | + |E|). We provide more details about the size of V and E as a function of L in Section 3.3. Learning During training, we jointly minimize the filtering loss and the segmentation loss. The filtering loss L local of the local classifier ϕ local is the sum of the negative log-probability of all gold labeled segments of the training set T . Since the filtering task is highly imbalanced, we down-weight the loss for the label l = null as a mean of regularization. The weighting ratio β ∈ [0, 1] is tuned on the development set: L local = - (i,j,l)∈T l̸ =null log p(i, j, l|x) -β × (i,j,l)∈T l=null log p(i, j, l|x) where p(i, j, l|x) is the probability that the segment (i, j) has the label l using the local classifier ϕ local . The loss of the segmentation model ϕ global is computed as: L global = -score(y|x) + log Z(x) Furthermore, during the training, we constrain the candidate segments V (Eq. 6) to contain the gold entity segments y, and we also ensure that the gold segmentation is a path of the filtered graph, i.e., all other candidate spans should be overlapping at least with one segment of the gold. This choice may be sub-optimal since it can cause exposure bias, i.e., a training-inference discrepancy. However, we found that it works well in practice, and suppressing it leads to unstable learning and a negative value of the global loss since score(y|x) can be larger than log Z(x). The total loss of the model is the sum of the local and global losses, L total = L global + L local . Inference During inference, the objective is to return the path (from start to end) of the graph that has the best score. We solve this problem using a max-sum dynamic programming algorithm that has the same structure as Algorithm 1:  Algorithm 2 Decoding 1: Topologically sort the nodes of V 2: δ[start] = 0 3: for all k ̸ = start in V do 4: δ[k] = max k ′ (k ′ →k)∈E δ[k ′ ] + w(s k ′ → s k |x)

3.3. COMPLEXITY ANALYSIS

In this section, we analyze the complexity of the algorithms (1 and 2) O(|V | + |E|) as a function of the input sequence length L. Note that the size of V does not depend on the number of labels |Y | since there is at most one label per segment due to the filtering step in equation 6. Proposition 2. There are L(L+1) 2 nodes in a complete segment path graph constructed using a sequence of length L. Proposition 3. There are L(L-1)(L+1) 6 edges in a complete segment path graph constructed from a sequence of length L. We use propositions 2 and 3 to derive the complexity of the Filtered Semi-CRF model, developed below. Their proofs can be found in Appendix A.1.

Worst case complexity

In the worst case, the filtering model ϕ local does not filter any segments, i.e., all segments are kept. From propositions 2 and 3, we can deduce that in the worst case, O(|V |) = O(L 2 ) and O(|E|) = O(L 3 ) which means that the complexity of our worst case algorithm is cubic as a function of the sequence length L since O(|V | + |E|) = O(L 3 ). However, note that in the worst case, the resulting graph is the Semi-CRF and the complexity can be reduced to L 2 using the algorithms Forward (during training) and Viterbi (during inference).

Best case complexity

The best case scenario means that the filtering is perfect, so the number of nodes in the graph |V | is equal to the true number of non-null segments in the input sequence, which we denote by J . Moreover, since J does not contain overlapping segments, |J | ≤ L with |J | = L if all segments in J have unit length and cover the entire sequence i.e J = {(i, i, l i )|i = 1 . . . L, l i ̸ = null}. Furthermore, |E| = |J | -1 ≤ L -1 because perfect filtering implies that the path number is unique. Finally, we can conclude that the complexity is linear i.e, O(|V | + |E|) = O(L). Empirical analysis We further investigate the empirical complexity of our approach by looking for a relationship between |V | + |E| and the sequence length L in practice. We performed the experiments on three text segmentation datasets, Conll-2003, OntoNotes 5.0 and Arabic ACE dedicated to the task of Named Entity Recognition. The results are shown in Figure 2 . The plots show that the graph size |V | + |E| is generally smaller than the sequence length L for a trained model, meaning that empirically, the complexity is close to the best case complexity which is O(L). However, during training, especially in the first stage, the size of the graph can be large because the filtering model may be poor, as illustrated in the figure 3 . Empirically, the early steps of the training can be time consuming due to larger graph size. However, after a few gradient steps, the size of the graph decreases significantly since most of the segments of an input sequence are labeled as null. The two axis are in log-scale and the data are smoothed using Savitzky-Golay filter (Savitzky and Golay, 1964) . There are three main stages. At the beginning of the training, the size of the graph is large because the filtering model is not trained. At the second stage, the size of the graph is small because the filtering model is confident about null segments (most segments are null). At the last stage, the size of the graph is stabilizing.

4.1. REPRESENTATION AND SCORES

For all our models, we used pre-trained transformer models (Devlin et al., 2019) to compute word representations. Specifically, the input sequence {x} n i=1 is fed into a pre-trained transformer producing a set of contextualized embeddings {h} n i=1 ∈ R D , with D the embedding size of the model. In addition, since pre-trained transformers typically separate words into sub-tokens, we use the first sub-token embedding as the representation of the whole word, which is a common practice for token-level prediction tasks. Token scores In our sequence labeling baseline, we compute the label score at position i as a linear projection of the token representation at the same position: ψ(y i |x) = w T y h i ∈ R, where w y ∈ R D×1 is a label-specific learnable weight vector. Segment scores For our segment-level models (Semi-CRF and FSemiCRF), we compute the representation s i:j of the segment (i, j) using a sum pooling of the representations of the tokens comprising the segment, s i:j = SUM([h i , h i+1 , . . . , h j ]). Indeed, according to Adi et al. (2017) , sum pooling can effectively model the length of the sequence. Moreover, for the segment-based models (i.e, Semi-CRF and Filtered Semi-CRF), we restrict the segment to a maximum width to reduce complexity without harming the recall score on the training set (however some segments may be missed for the test set). By bounding the maximum width of the segments, we reduce the number of segments from L 2 to LK, where K is the maximum width. Thus, under this setup, the the complexity of the Semi-Markov CRF become O(LK|Y | 2 ). Finally, the segment scores (i.e all ϕ . (s k )) are computed using a linear projection of the segment representations, analogous to token scores.

4.2. SETUP

Datasets and evaluation We evaluate our models on three diverse datasets of Named Entity Recognition. conll-2003 (Tjong Kim Sang and De Meulder, 2003) is a dataset from the news domain designed for extracting entities such as Person, Location, and Organisation. OntoNotes 5.0 (Weischedel et al., 2013) is a large corpus comprising various kinds of text, including newswire, broadcast news, and telephone conversation, with a total of 18 different entity types, such as Person, Organization, Location, Product, or Date. Arabic ACE is the Arabic portion of the multilingual information extraction corpus, ACE 2005 (Walker et al., 2006) . It includes texts from a wide range of genres, such as newswire, broadcast news, and weblogs, with a total of 7 entity types. We follow the standard common approach for evaluating NER models, based on exact matching between predicted and gold entities, discarding non-entity segments. We report the micro-averaged precision (P), recall (R), and the F1-score (F) on the test set for models selected on the dev set. (Zhu and Li, 2022) 93.61 93.68 93.65 91.75 91.74 91.74 (Shen et al., 2022) 93.29 92.46 92.87 91.43 90.73 (Yan et al., 2021) that uses bart-large. † See ablation study (sec. 5.2) for details about these models. Hyperparameters To produce contextual token representations, we used bert-large-cased (Devlin et al., 2019) for both conll-2003 and OntoNotes 5.0 datasets, and bert-base-arabertv2 (Antoun et al., 2020) for Arabic ACE. For simplicity, we do not use auxiliary embeddings (eg. character embeddings). All models are trained with Adam optimizer (Kingma and Ba, 2017). We employed a learning rate of 2e-5 for the pre-trained parameters and a learning rate of 5e-4 for the other parameters. We used a batch size of 8 and trained for a maximal epoch of 15. We keep the best model on the validation set for testing. We trained all the models on a server equipped with V100 GPUs. We implemented our model with PyTorch (Paszke et al., 2019) . The pre-trained transformer models were loaded from HuggingFace's Transformers library (Wolf et al., 2019) . We used AllenNLP (Gardner et al., 2018) for data preprocessing and the seqeval library (Nakayama, 2018) for evaluating the sequence labeling models. Our Semi-CRF implementation is based on pytorch-struct (Rush, 2020) . Baselines We compare our Filtered Semi-CRF against the CRF (Lafferty et al., 2001) and the Semi-CRF (Sarawagi and Cohen, 2005) . We also report some results from the literature: Bi-affineNER (Yu et al., 2020) , Bart-NER (Yan et al., 2021) , Boundary Smoothing (Zhu and Li, 2022) and PIQN (Shen et al., 2022) . For English datasets, all the models are using bert-large-case for token representation except BartNER which used bart-large (Lewis et al., 2020) . Moreover, for a fair comparison, we only report results for models using sentence-level context (in contrast to paragraph-level context).

5.1. MAIN RESULTS

CRF v.s. Semi-CRF v.s. FSemiCRF We here compare our proposed model to the CRF and Semi-CRF baseline models reported in Table 1 . Semi-CRF is the worst-performing model, with the lowest scores on conll-2003 and OntoNotes 5.0 datasets and the same performance as CRF on the Arabic ACE dataset. Moreover, on all datasets, our proposed FSemiCRF outperforms CRF and Semi-CRF in terms of precision and recall, demonstrating its utility in a variety of scenarios. Furthermore, we find that there is no significant difference between FSemiCRF with and without transition scores (in fact, most of the time the result is the same), which can be explained by the fact that adjacent segments in the filtered graph may be far from each other. Comparison to SOTA Compared to the state-of-the-art models, our FSemiCRF has the highest score on the Conll-2003 dataset, outperforming the second-highest score by 0.24 in terms of F1score. On OntoNotes, while not the best, our model is still competitive. Table 2: Model Throughput (higher is better). We measure the throughput of the model in batch per second, using a batch size of 8 on a V100 GPU. All models use the same vector dimensions for token representation for fair comparison.

5.2. ABLATION STUDY

Semi-CRF + Unit size null We study an alternative variant of the Semi-CRF that allows null labels only for segments of unit length. To do this, we simply modify the original Semi-CRF by eliminating/masking segmentation paths that contain null segments whose size is greater than one. The motivation for this study is to reduce the search space and force out segmentation ambiguity. We can see that it improves the results on conll-2003 and OntoNotes 5.0. However, the results are still poor compared to the other approaches. FSemiCRF w/o global loss As shown in Table 1 , we investigate the influence of global loss on FSemiCRF by removing it, resulting in a local span-based NER model. Its decoding is performed using a greedy algorithm where the highest-scoring entity is iteratively added to the result as long as it does not overlap with the previously selected entities. As shown in the Table 1 , even without the global loss, the model is competitive, but the global model consistently improves the scores.

5.3. EFFICIENCY ANALYSIS

In this section, we analyse the computational efficiency of the models both for training and inference. We performed two experiments: 1) the training and inference throughput in Table 2 , measured in batch per second; 2) the inference wall clock time for comparing the Semi-CRF and FSemi-CRF to show the time needed for computing the span scores and the decoding, in millisecond per sample. For both experiments, we use a batch size of 8 and an Nvidia V100 GPU with 16 GB of memory. For a fair comparison, for all the datasets and models, we employed a similar model size for the token representation, namely bert-base-cased for Conll-2003 and OntoNotes 5.0 and bert-base-arabertv2 for Arabic ACE. Throughput For training, the results show that the CRF model is the fastest for most of the datasets. Then, the FSemiCRF is the second fastest; it has a better training throughput than the Semi-CRF on all datasets except for Conll-2003. We empirically found that the speed of the Semi-CRF depends strongly on the number of labels; therefore, it is fast on Conll-2003 since this dataset has only a few label types. During inference, our FSemiCRF is significantly faster than other methods: it is 5 times faster than Semi-CRF on OntoNotes 5.0 and 2 times faster on Arabic ACE. We explain this behavior by two main points: 1) During inference, segment filtering is highly parallelizable, while during training it is not. 2) The complexity of FSemiCRF strongly depends on the performance of the filtering model; at the early stage of training, the filtering model may be poor, which leads to a larger graph (as shown in the figure 3 ) while the size of the graph is generally small during inference. See section 3.3 for more detail. Wall clock time We performed a wall clock time analysis of the Semi-CRF and Filtered Semi-CRF on the table 3. As shown in the table, computing the segment scores (using bert-based models) is the same for both approaches. However, for the decoding, Semi-CRF applies the segmental Viterbi algorithm to the segments, while FSemiCRF only uses the filtered segments. This study shows that the decoding time of the FSemiCRF is almost negligible compared to computing the segment scores. In contrast, the decoding for Semi-CRF is significantly slower. Noticeably, the decoding is sometimes slower than computing the segment score for the Semi-CRF, which is the case on OntoNotes 5.0 and Arabic ACE datasets. 

6. RELATED WORK

Many frameworks have been proposed for text segmentation. The most popular is the Linear-Chain CRF (Lafferty et al., 2001) , which treats text segmentation tasks as token-level prediction. It is trained by maximizing the sequence-level objective of the gold standard labeling and using the Viterbi algorithm (Viterbi, 1967; Forney, 2010) for decoding, adding some constraints to the transition matrix to enforce the well-formedness of the output. First variants employed handcrafted features (Lafferty et al., 2001; Gross et al., 2006; Roth and tau Yih, 2005) and it has been further extended to automatic feature learning using neural networks (Do and Artières, 2010; van der Maaten et al., 2011; Kim et al., 2015; Huang et al., 2015; Lample et al., 2016) . Usually, CRF is used with a 1st order Markov transition on the labels, but other methods such as Ye et al. (2009) and Cuong et al. (2014) have proposed to employ higher order dependency to further enhance the performance, however due to the high complexity and the marginal gains, it has not gained in popularity. Semi-CRF (Sarawagi and Cohen, 2005) has been proposed as an alternative to CRF for sequence segmentation tasks. Instead of operating on the token level, the Semi-CRF considers segments as the basic unit for the prediction. It has been applied to several sequence segmentation tasks, such as Chinese word segmentation (Kong et al., 2016) and Named Entity Recognition (Sarawagi and Cohen, 2005; Andrew, 2006; Zhuo et al., 2016; Liu et al., 2016; Ye and Ling, 2018) . Its main advantage over traditional CRFs is that it can incorporate segment-level features such as segment length, which can help obtain a model with higher predictive ability. However, it presents two major shortcomings: it has quadratic complexity as a function of the sequence length, which makes it difficult to apply for long sequences, and it generally obtains inferior or marginal gains over the CRFs (Liang, 2005; Daumé and Marcu, 2005; Andrew, 2006) . In this work, we proposed a more efficient alternative by adding a filtering step that drops null segments. Our approach provides significantly better performance and more efficient inference than both CRF and Semi-CRF.

7. CONCLUSION

In this paper, we proposed Filtered Semi-CRF, a novel technique for text segmentation tasks. We applied our method to the Named Entity Recognition (NER) task and obtained significant gain over traditional CRF and Semi-CRF models on various benchmark datasets. In addition to being more efficient, our algorithm is faster and more scalable than the baseline models. In future work, we plan to extend our algorithm to nested segment structures.

A APPENDIX

A.1 PROOFS Proposition 2. There are L(L+1) 2 nodes in a complete segment path graph constructed using a sequence of length L. Proof. Nodes are the enumeration of all segments (regardless of labels). Thus, V = Proof. We know that in the complete segment graph 1. By definition, (i k , j k ) → (i k ′ , j k ′ ) ∈ E iff j k + 1 = i k ′ 2. There are j k segments ending at j k i.e | j k i=1 (i, j k )| = j k 3. There are L -j k segments starting at i k ′ i.e | L i=i k ′ (i k ′ , i)| = L -i k ′ + 1 = L -j k From 1, 2 and 3, we can deduce that there is j k (L -j k ) segments starting at i k ′ and ending at j k . Finally, the total number of edges of the graph is the sum over all j k from 0 to L:  |E| = L j k =1 j k (L -j k ) = L L j k =1 j k - L j k =1 The best labeling is given by the path traced by max y∈Y δ(L, y). Both the computation of the partition function and the decoding of the CRF have a complexity of O(L|Y | 2 ).



It is worth noting that the segmentation problem can be formulated as finding a the highest scoring Maximal Independent Set (MIS) in a interval graph(Gupta et al., 1982).



y consists of a start position i k , an end position j k , and a label l k ∈ Y .• The segments have positive lengths and completely cover the sequence 1 . . . L without overlapping, i.e., j k and i k always satisfy

Figure1: Filtered Semi-Markov CRF. 1) Enumerate all the segments of the input sequence. 2) null segments are dropped (Eq. 6) using a local segment classifier ϕ local . 3) Construct the path graph from the filtered segments; we omit the transition scores for better readability. 4) During training, we compute the loss function (Eq. 9 and 10) by constraining the gold path y * to be a path of the graph, and during inference, return the maximum weighted path (Alg. 2). Please note that the size of the graph can vary a lot depending on the input sequence and training stage (Fig.2 and 3).

5: end forThe highest scoring path, i.e argmax y score(y|x), is the path traced by δ[end] which can be obtained by backtracking. This algorithm has a complexity of O(|V | + |E|), the same as computing the partition function Z(x).

Figure 2: Empirical complexity analysis. This plot illustrates the relationship between the size of the filtered graph (|V | + |E|) and the input sequence length L, on three NER datasets. This experment is done with trained models.

Figure 3: Evolution of the graph size during training. The two axis are in log-scale and the data are smoothed using Savitzky-Golay filter(Savitzky and Golay, 1964). There are three main stages. At the beginning of the training, the size of the graph is large because the filtering model is not trained. At the second stage, the size of the graph is small because the filtering model is confident about null segments (most segments are null). At the last stage, the size of the graph is stabilizing.

There are L(L-1)(L+1) 6 edges in a complete segment path graph constructed from a sequence of length L.

The partition function Z(x) of the CRF(Lafferty et al., 2001) is computed using the forward algorithm, with α(1, y) = ψ(y|x) and for i = 2 . . . L:α(i, y) = y ′ ∈Y α(i -1, y ′ ) exp{ψ(y|x) + T y ′ ,y }The decoding of CRF is done with the Viterbi algorithm, with δ(1, y) = ψ(y|x)δ(i, y) = max y ′ ∈Y δ(i -1, y ′ ) + ψ(y|x) + T y ′ ,y

Unit size null † 92.08 91.41 91.74 89.17 89.76 89.47 83.35 83.62 83.48 FSemiCRF 94.72 93.09 93.89 90.69 91.31 91.00 83.43 85.51 84.46 -w/o L global (10) † 94.24 92.70 93.46 90.85 89.57 90.21 83.73 83.56 83.64 Main results. All English models employ bert-large-cased for representing the tokens on English datasets, except

Wall clock time (lower is better). This table reports the average wall-clock time comparison of Semi-CRF and Filtered Semi-CRF in milliseconds (per sample). We separate the time needed for computing the segment representations (with BERT) and the decoding algorithm. Please note that the scoring time is the same for Semi-CRF and FSemiCRF. We use the same setup as in table 2.

A.3 SEMI-CRF

Partition function The partition function of the Semi-CRF (Sarawagi and Cohen, 2005 ) Z(x) is computed using the following dynamic program (a modification of the forward algorithm) with α(0, :) = 1 and α(m, :) = 0 if m < 0 and otherwise:Decoding The decoding of the Semi-CRF is done with the segmental/Semi-Markov Viterbi algorithm with δ(0, :) = 0 and δ(m, :) = -∞ if m < 0 and otherwise:The highest scoring segmentation is the path traced by max y∈Y δ(L, y). Both the computation of the partition function and the decoding of the Semi-CRF have a complexity of O(L 2 |Y | 2 ).

