This project is testing different models of coherence to find regions of coherent text in scientific discourse, in particular two types of coherent regions:
This project would suit a student who is interested in algorithms (e.g., the lexical chain algorithm), and who likes data work (e.g., looking through dozens of citation blocks, deciding where they start and end). The student should have good intuition about writing style in science, and be able to generalise over similarities in writing style. Programming language of choice.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.
Hearst, M. Text Tiling. Computational Linguistics. 1997.
Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.
Barzilay and Lapata, Entity-based coherence, ACL 2005.
This project concerns anaphora resolution in scientific text, with a particular focus on pronouns and demonstrative definite NPs. The project consists of the following stages:
This project will probably require some manual annotation from the student. Definite NP reference is particularly difficult in scientific discourse and will not be addressed, unless particularly fast progress is achieved. A comparison to a baseline algorithm such as lingpipe coreference (which is not specialised to scientific literature) should be performed.
An ideal student for this project would be rather goal-driven, as several subtasks need to be solved and individually evaluated, and be linguistically interested. Programming language of choice.
Kim and Webber (2006). Automatic reference resolution in astronomy articles. International CODATA Conference, Beijing.
Ge et. al (1998). A
statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora - COLING-ACL’98, Montreal, Canada.
Kaplan and Tokunaga (2009). A citation-based approach to summarisation. NLPIR4DL, ACL workshop, Singapore.
Sentence-based abstract-document alignment has great benefits for summarisation, as sentences in the abstract are often picked from the rest of the document. How many differences there are between such sentence pairs depends on the individual writing style, but overall few changes are observed. The goal of this project is to explore different methods for doing so. A small gold standard corpus of 80 aligned abstract--document pairs exists, which could be expanded with very little work from the student if necessary.
In general, it is hard finding good semantic similarity
metrics for linguistic objects as short as sentences. Both the longest common substring algorithm (Teufel and Moens 1997) and Vector space comparison (e.g., as in Marcu, 99) is known to perform rather weak - such simple algorithms can,
however, be used as a baseline in this project.
There are two flavours of this project. The first is an (as close to exhaustive) exploration of different variables in semantic spaces for this project: syntactic vs. keyword based, and dimensionality reduction methods such as LSI. The second option is to only implement ONE promising semantic space, and to combine this measure of semantic similarity with two additional facts about abstract--document alignment:
The student best suited for this project should be a competent programmer and very systematic (for option1), or mathematically interested and have an interest in developing new algorithms (for option2).
Teufel, S. and Moens, M. (1997). A gold standard for abstract-document alignment. ACL-workshop on Automatic Summarization.
Marcu (1999). The automatic construction of large-scale corpora for summarization research. SIGIR-99.
A Comparison of Semantic Spaces for Abstract-Document Alignment
Description
Option 2 turns the best choice of alignment into a constraint satisfaction problem.
References: