FILTERED SEMI-MARKOV CRF

Abstract

Semi-Markov CRF (Sarawagi and Cohen, 2005) has been proposed as an alternative to the traditional Linear Chain CRF (Lafferty et al., 2001) for text segmentation tasks such as Named Entity Recognition. In contrast to CRF, which treats text segmentation as token-level prediction, Semi-CRF considers spans as the task's basic unit, which makes it more expressive. However, Semi-CRF has two major drawbacks: (1) it has quadratic complexity over sequence length as it operates on every span of the input sequence, and (2) empirically, it performs worse than classical CRF for sequence labeling tasks such as NER. In our work, we propose Filtered Semi-Markov CRF, a Semi-CRF variant that addresses the aforementioned issues. Our model extends Semi-CRF by incorporating a filtering step for eliminating irrelevant segments, which helps reduce the complexity and dramatically reduce the search space. On a variety of NER benchmarks, we find that our approach outperforms both CRF and Semi-CRF models while being significantly faster. We will make our code available to the public.

1. INTRODUCTION

Sequence segmentation is the process of dividing a sequence into several distinct, non-overlapping segments to cover the entire sequence (Sarawagi and Cohen, 2005; Terzi, 2006) . It has a wide range of use cases, including Named Entity Recognition (Tjong Kim Sang and De Meulder, 2003) and Chinese Word Segmentation (Li and Yuan, 1998) . Sequence segmentation has traditionally been seen as a sequence labeling problem using pre-existing templates such as BIO and BILOU schemes (Ratinov and Roth, 2009) . Conditional Random Field (CRF) (Lafferty et al., 2001) has been widely used in sequence labeling problems to model the dependence between adjacent token tags. Although the Linear-chain CRF has performed well in various segmentation tasks, operating at the segment level rather than the token level would be a more natural way to perform sequence segmentation. To this end, the Semi-Markov CRF (Sarawagi and Cohen, 2005) has been proposed as a variant of CRF, allowing for the incorporation of higher-level segment features, such as segment width. However, Semi-CRF, unlike CRF, is considerably slower for both learning and inference due to its quadratic complexity with respect to the sequence length. Moreover, Semi-CRF generally performs worse than CRF (sometimes the Semi-CRF performs better but the gain is only marginal) (Liang, 2005; Daumé and Marcu, 2005; Andrew, 2006) . Indeed, the Semi-CRF performs joint segmentation and labeling which results in a much larger search space making learning more challenging. To address this problem, we propose a filtered version of the Semi-CRF. Like Semi-CRF (Sarawagi and Cohen, 2005) , our model operates on segments, but we add a filtering model to discard a large number of candidate segments. Our aim is to reduce the computational complexity by pruning the segmentation search space. During inference and after the filtering step, finding the best segmentation and labeling boils down to finding the maximum scoring path in a weighted directed acyclic graph. During training we use a similar dynamic programming algorithm allowing us to sum over all paths in the graph. We evaluate our approach on benchmark datasets for Named Entity Recognition and find that it performs better than CRF and Semi-CRF models with noticeably faster inference. The rest of this paper is organized as follows. In the next section, we provide some background and context for understanding the foundational CRF and Semi-CRF models. Next, we present our filtered Semi-CRF model in detail, followed by the experimental setup, the results and further experimental analysis, and an overview of related works. The final section concludes this paper.

annex

In this section, we first present the Linear-Chain CRF (Lafferty et al., 2001) and then the semi-Markov CRF (Sarawagi and Cohen, 2005) , namely their structured representation and their learning and inference algorithms.

2.1. LINEAR CHAIN CRF

The Linear-Chain CRF (Lafferty et al., 2001) is a sequence labeling model that assigns a label to each token in the input sequence. It assumes dependencies between adjacent output labels (typically a Markov dependency of order 1). Hence, given an input sequence x, a sequence of labels y of the same size L is produced with y i ∈ Y . The conditional probability of y given x is computed using the following estimator:where ψ(y i |x) ∈ R is the score of the label at position i and T ∈ R |Y |×|Y | is a learnable label transition matrix defined for each pair of labels. Furthermore,is the partition function that serves as a normalizer of the probability distribution, where Y(x) is the set of all possible label sequences admissable for x.During training, the goal is to update all model parameters by minimizing the negative log probabilities of the gold labels:is computed in polynomial time using the Forward algorithm (See Eq. 13 in Appendix A.2 for details). For inference, the goal is to produce the optimal segmentation y * = argmax y Ψ(y|x), which is computed using the Viterbi algorithm (Eq. 14 in Appendix A.2). The CRF has linear complexity in terms of the sequence length L, and quadratic complexity in terms of the number of labels |Y | for both learning and inference, i.e., O(L|Y | 2 ).

2.2. SEMI-MARKOV CRF

Unlike the Linear-chain CRF, the Semi-CRF (Sarawagi and Cohen, 2005) operates at the segment level to account for segment features that cannot be easily modeled using sequence labeling. The Semi-CRF produces a segmentation y (of size M ) of input sequence x (of size L, with L ≥ M ).The conditional probability of the labeled segmentation y given an input x is computed as follows: The model parameters are learned to maximize the conditional probability of gold segmentation p(y|x) over the training data, similar to CRF. The partition function Z(x) = y ′ ∈Y(x) exp Φ(y|x) can be computed in polynomial time using a modification of the Forward algorithm (Eq. 15 in Appendix A.3), and inference is done by segmental Viterbi (Eq. 16 in Appendix A.3) to produce the best segmentation y * = argmax y Φ(y|x). Finally, the Semi-CRF has quadratic complexity in terms of both sequence length and the number of labels for both learning and inference, i.e., O(L 2 |Y | 2 ).

