Computer Laboratory

Technical reports

Probabilistic word sense disambiguation
Analysis and techniques for combining knowledge sources

Judita Preiss

August 2006, 108 pages

This technical report is based on a dissertation submitted July 2005 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity College.


This thesis shows that probabilistic word sense disambiguation systems based on established statistical methods are strong competitors to current state-of-the-art word sense disambiguation (WSD) systems.

We begin with a survey of approaches to WSD, and examine their performance in the systems submitted to the SENSEVAL-2 WSD evaluation exercise. We discuss existing resources for WSD, and investigate the amount of training data needed for effective supervised WSD.

We then present the design of a new probabilistic WSD system. The main feature of the design is that it combines multiple probabilistic modules using both Dempster-Shafer theory and Bayes Rule. Additionally, the use of Lidstone’s smoothing provides a uniform mechanism for weighting modules based on their accuracy, removing the need for an additional weighting scheme.

Lastly, we evaluate our probabilistic WSD system using traditional evaluation methods, and introduce a novel task-based approach. When evaluated on the gold standard used in the SENSEVAL-2 competition, the performance of our system lies between the first and second ranked WSD system submitted to the English all words task.

Task-based evaluations are becoming more popular in natural language processing, being an absolute measure of a system’s performance on a given task. We present a new evaluation method based on subcategorization frame acquisition. Experiments with our probabilistic WSD system give an extremely high correlation between subcategorization frame acquisition performance and WSD performance, thus demonstrating the suitability of SCF acquisition as a WSD evaluation task.

Full text

PDF (0.8 MB)

BibTeX record

  author =	 {Preiss, Judita},
  title = 	 {{Probabilistic word sense disambiguation : Analysis and
         	   techniques for combining knowledge sources}},
  year = 	 2006,
  month = 	 aug,
  url = 	 {},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-673}