Computer Laboratory

Technical reports

Word sense selection in texts: an integrated model

Oi Yee Kwong

September 2000, 177 pages

This technical report is based on a dissertation submitted May 2000 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Downing College.

Abstract

Early systems for word sense disambiguation (WSD) often depended on individual tailor-made lexical resources, hand-coded with as much lexical information as needed, but of severely limited vocabulary size. Recent studies tend to extract lexical information from a variety of existing resources (e.g. machine-readable dictionaries, corpora) for broad coverage. However, this raises the issue of how to combine the information from different resources.

Thus while different types of resource could make different contribution to WSD, studies to date have not shown what contribution they make, how they should be combined, and whether they are equally relevant to all words to be disambiguated. This thesis proposes an Integrated Model as a framework to study the inter-relatedness of three major parameters in WSD: Lexical Resource, Contextual Information, and Nature of Target Words. We argue that it is their interaction which shapes the effectiveness of any WSD system.

A generalised, structurally-based sense-mapping algorithm was designed to combine various types of lexical resource. This enables information from these resources to be used simultaneously and compatibly, while respecting their distinctive structures. In studying the effect of context on WSD, different semantic relations available from the combined resources were used, and a recursive filtering algorithm was designed to overcome combinatorial explosion. We then investigated, from two directions, how the target words themselves could affect the usefulness of different types of knowledge. In particular, we modelled WSD with the cloze test format, i.e. as texts with blanks and all senses for one specific word as alternative choices for filling the blank.

A full-scale combination of WordNet and Roget’s Thesaurus was done, linking more than 30,000 senses. Using these two resources in combination, a range of disambiguation tests was done on more than 60,000 noun instances from corpus texts of different types, and 60 blanks from real cloze texts. Results show that combining resources is useful for enriching lexical information, and hence making WSD more effective though not completely. Also, different target words make different demand on contextual information, and this interaction is closely related to text types. Future work is suggested for expanding the analysis on target nature and making the combination of disambiguation evidence sensitive to the requirements of the word being disambiguated.

Full text

PS (0.6 MB)

BibTeX record

@TechReport{UCAM-CL-TR-504,
  author =	 {Kwong, Oi Yee},
  title = 	 {{Word sense selection in texts: an integrated model}},
  year = 	 2000,
  month = 	 sep,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-504.ps.gz},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-504}
}