Computer Laboratory

Technical reports

Subcategorization acquisition

Anna Korhonen

February 2002, 189 pages

This technical report is based on a dissertation submitted September 2001 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity Hall.

Abstract

Manual development of large subcategorised lexicons has proved difficult because predicates change behaviour between sublanguages, domains and over time. Yet access to a comprehensive subcategorization lexicon is vital for successful parsing capable of recovering predicate-argument relations, and probabilistic parsers would greatly benefit from accurate information concerning the relative likelihood of different subcategorisation frames SCFs of a given predicate. Acquisition of subcategorization lexicons from textual corpora has recently become increasingly popular. Although this work has met with some success, resulting lexicons indicate a need for greater accuracy. One significant source of error lies in the statistical filtering used for hypothesis selection, i.e. for removing noise from automatically acquired SCFs.

This thesis builds on earlier work in verbal subcategorization acquisition, taking as a starting point the problem with statistical filtering. Our investigation shows that statistical filters tend to work poorly because not only is the underlying distribution zipfian, but there is also very little correlation between conditional distribution of SCFs specific to a verb and unconditional distribution regardless of the verb. More accurate back-off estimates are needed for SCF acquisition than those provided by unconditional distribution.

We explore whether more accurate estimates could be obtained by basing them on linguistic verb classes. Experiments are reported which show that in terms of SCF distributions, individual verbs correlate more closely with syntactically similar verbs and even more closely with semantically similar verbs, than with all verbs in general. On the basis of this result, we suggest classifying verbs according to their semantic classes and obtaining back-off estimates specific to these classes.

We propose a method for obtaining such semantically based back-off estimates, and a novel approach to hypothesis selection which makes use of these estimates. This approach involves automatically identifying the semantic class of a predicate, using subcategorization acquisition machinery to hypothesise conditional SCF distribution for the predicate, smoothing the conditional distribution with the back-off estimates of the respective semantic verb class, and employing a simple method for filtering, which uses a threshold on the estimates from smoothing. Adopting Briscoe and Carroll’s (1997) system as a framework, we demonstrate that this semantically-driven approach to hypothesis selection can significantly improve the accuracy of large-scale subcategorization acquisition.

Full text

PDF (1.1 MB)

BibTeX record

@TechReport{UCAM-CL-TR-530,
  author =	 {Korhonen, Anna},
  title = 	 {{Subcategorization acquisition}},
  year = 	 2002,
  month = 	 feb,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-530.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-530}
}