Computer Laboratory

Technical reports

Statistical anaphora resolution in biomedical texts

Caroline V. Gasperin

December 2009, 124 pages

This technical report is based on a dissertation submitted August 2008 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Clare Hall.

Abstract

This thesis presents a study of anaphora in biomedical scientific literature and focuses on tackling the problem of anaphora resolution in this domain. Biomedical literature has been the focus of many information extraction projects; there are, however, very few works on anaphora resolution in biomedical scientific full-text articles. Resolving anaphora is an important step in the identification of mentions of biomedical entities about which information could be extracted.

We have identified coreferent and associative anaphoric relations in biomedical texts. Among associative relations we were able to distinguish 3 main types: biotype, homolog and set-member relations. We have created a corpus of biomedical articles that are annotated with anaphoric links between noun phrases referring to biomedical entities of interest. Such noun phrases are typed according to a scheme that we have developed based on the Sequence Ontology; it distinguishes 7 types of entities: gene, part of gene, product of gene, part of product, subtype of gene, supertype of gene and gene variant.

We propose a probabilistic model for the resolution of anaphora in biomedical texts. The model seeks to find the antecedents of anaphoric expressions, both coreferent and associative, and also to identify discourse-new expressions. The model secures good performance despite being trained on a small corpus: it achieves 55-73% precision and 57-63% recall on coreferent cases, and reasonable performance on different classes of associative cases. We compare the performance of the model with a rule-based baseline system that we have also developed, a naive Bayes system and a decision trees system, showing that the ours outperforms the others.

We have experimented with active learning in order to select training samples to improve the performance of our probabilistic model. It was not, however, more successful than random sampling.

Full text

PDF (1.8 MB)

BibTeX record

@TechReport{UCAM-CL-TR-764,
  author =	 {Gasperin, Caroline V.},
  title = 	 {{Statistical anaphora resolution in biomedical texts}},
  year = 	 2009,
  month = 	 dec,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-764.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-764}
}