Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:
Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.
This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.
Hearst, M. Text Tiling. Computational Linguistics. 1997.
Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.
Barzilay and Lapata, Entity-based coherence, CL 2008.
Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.
Plots in narrative texts are based on a) the main particants in the story b) the main events in a story and c) the time line. Textual coherence is of importance in the analysis of plots as well. The aim of this project is to build a model of the three main plot components in short stories using named entity recognition, a parser, and a model of lexical coherence. Once a) b) and/or c) are detected, it is possible to generate different kinds of summaries, depending on whether we want to summarise the story for somebody who is yet to read the text (in which case we do not want to give away the plot, but give a feeling for the setting of the text), or somebody who does not intend to read the text but wants a summary for other purposes (in which case we want to summarise the main points).
I propose to use older short stories of foreign origin, where the copyright has lapsed, and where there is often more than one translation. This will allow us to test our models of lexical cohension. Knowing that both translations contain basically the same information will help pinpoint problems with detection of coherence.
Another advantage of using well-known older novels is that multiple human-written summaries (both "blurbs" and cliff-note-type informative summaries) exist for these texts, which will allow the student to use the ROUGE methodology as cheap evaluation. More fine-grained and informative evaluation would include the elicitation of human judgements, but is optional in this problem.
This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate the detection of plot in longer texts, and possibly to use this information for summarisation. Based on whether the student wishes to concentrate on a), b) or c), the task can be attacked in several ways:
This project is very ambitious, and could lead to a better understanding of the discourse structure of narrative texts. However, the existence of a simple model based on the minimal implementation of a)-c) as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.
Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation may have to be performed.
Lehnert (1981); Plot units and narrative summarization. Cognitive Science
Mani, Pustejovsky (2004): Temporal discourse models for narrative structure. In: Proceedings of the 2004 ACL Workshop on Discourse Annotation
Jaynes, Golden (2003): Statistical Detection of Local Coherence Relations in Narrative Recall and Summarization Data. In: Proceedings of the 25th Annual Conference of the Cogntive Science Society.
Nakhimovsky (1988): Aspect, aspectual class, and the temporal structure of narrative. Computational Linguistics, 1988.
Kazantseva and Szpakowicz (2010): Summarizing Short Stories. Computational Linguistics 36(1): 71-109
Lin and Hovy (2003). Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003). ROUGE SW
Boguraev and Kennedy (1999). Salience-based content characterisation of text documents. In: Advances in Automatic text Summarization, pp. 2--9.
Gutenberg; short stories