Computer Laboratory

Technical reports

Practical unification-based parsing of natural language

John Andrew Carroll

173 pages

This technical report is based on a dissertation submitted September 1993 by the author for the degree of Doctor of Philosophy to the University of Cambridge.

Abstract

The thesis describes novel techniques and algorithms for the practical parsing of realistic Natural Language (NL) texts with a wide-coverage unification-based grammar of English. The thesis tackles two of the major problems in this area: firstly, the fact that parsing realistic inputs with such grammars can be computationally very expensive, and secondly, the observation that many analyses are often assigned to an input, only one of which usually forms the basis of the correct interpretation.

The thesis starts by presenting a new unification algorithm, justifies why it is well-suited to practical NL parsing, and describes a bottom-up active chart parser which employs this unification algorithm together with several other novel processing and optimisation techniques. Empirical results demonstrate that an implementation of this parser has significantly better practical performance than a comparable, state-of-the-art unification-based parser. Next, techniques for computing an LR table for a large unification grammar are described, a context free non-deterministic LR parsing algorithm is presented which has better time complexity than any previously reported using the same approach, and a unification-based version is derived. In experiments, the performance of an implementation of the latter is shown to exceed both the chart parser and also that of another efficient LR-like algorithm recently proposed.

Building on these methods, a system for parsing text taken from a given corpus is described which uses probabilistic techniques to identify the most plausible syntactic analyses for an input from the often large number licensed by the grammar. New techniques implemented include an incremental approach to semi-supervised training, a context-sensitive method of scoring sub-analyses, the accurate manipulation of probabilities during parsing, and the identification of the highest ranked analyses without exhaustive search. The system attains a similar success rate to approaches based on context-free grammar, but produces analyses which are more suitable for semantic processing.

The thesis includes detailed analyses of the worst-case space and time complexities of all the main algorithms described, and discusses the practical impact of the theoretical complexity results.

Full text

PS (0.3 MB)

BibTeX record

@TechReport{UCAM-CL-TR-314,
  author =	 {Carroll, John Andrew},
  title = 	 {{Practical unification-based parsing of natural language}},
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-314.ps.gz},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-314}
}