Department of Computer Science and Technology

Technical reports

Generalised probabilistic LR parsing of natural language (corpora) with unification-based grammars

Ted Briscoe, John Carroll

45 pages

DOI: 10.48456/tr-224

Abstract

We describe work towards the construction of a very wide-coverage probabilistic parsing system for natural language (NL), based on LR parsing techniques. The system is intended to rank the large number of syntactic analyses produced by NL grammars according to the frequency of occurrence of the individual rules deployed in each analysis. We discuss a fully automatic procedure for constructing an LR parse table from a unification-based grammar formalism, and consider the suitability of alternative LALR(1) parse table construction methods for large grammars. The parse table is used as the basis for two parsers; a user-driven interactive system which provides a computationally tractable and labour-efficient method of supervised learning of the statistical information required to drive the probabilistic parser. The latter is constructed by associating probabilities with the LR parse table directly. This technique is superior to parsers based on probabilistic lexical tagging or probabilistic context-free grammar because it allows for a more context dependent probabilistic language model, as well as use of a more linguistically adequate grammar formalism. We compare the performance of an optimised variant of Tomita’s (1987) generalised LR parsing algorithm to an (efficiently indexed and optimised) chart parser. We report promising results of a pilot study training on 151 noun definitions from the Longman Dictionary of Contemporary English (LDOCE) and retesting on these plus a further 54 definitions. Finally we discuss limitations of the current system and possible extensions to deal with lexical (syntactic and semantic) frequency of occurrence.

Full text

PDF (4.0 MB)

BibTeX record

@TechReport{UCAM-CL-TR-224,
  author =	 {Briscoe, Ted and Carroll, John},
  title = 	 {{Generalised probabilistic LR parsing of natural language
         	   (corpora) with unification-based grammars}},
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-224.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-224},
  number = 	 {UCAM-CL-TR-224}
}