Project 1: Modelling Lexical Coherence

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:

    A very similar project 2 years ago (Testuggine 2012) found the entity-based model to outperform lexical similarity on the task of finding related work sections in text. This project would build upon that work by a) moving to a "simpler", more clear-cut domain (news paper text). Intellectually, the largest impact of this project is in devising a method for scoring, rather than ranking, coherence.

    Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.

    This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.

    References:

    Hearst, M. Text Tiling. Computational Linguistics. 1997.

    Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

    Barzilay and Lapata, Entity-based coherence, CL 2008.

    Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.

    Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

    Project 2: Discovering Plots in Narrative Texts

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    Plots in narrative texts are based on a) the main particants in the story b) the main events in a story and c) the time line. Textual coherence is of importance in the analysis of plots as well. The aim of this project is to build a model of the three main plot components in short stories using named entity recognition, a parser, and a model of lexical coherence. Once a) b) and/or c) are detected, it is possible to generate different kinds of summaries, depending on whether we want to summarise the story for somebody who is yet to read the text (in which case we do not want to give away the plot, but give a feeling for the setting of the text), or somebody who does not intend to read the text but wants a summary for other purposes (in which case we want to summarise the main points).

    I propose to use older short stories of foreign origin, where the copyright has lapsed, and where there is often more than one translation. This will allow us to test our models of lexical cohension. Knowing that both translations contain basically the same information will help pinpoint problems with detection of coherence.

    Another advantage of using well-known older novels is that multiple human-written summaries (both "blurbs" and cliff-note-type informative summaries) exist for these texts, which will allow the student to use the ROUGE methodology as cheap evaluation. More fine-grained and informative evaluation would include the elicitation of human judgements, but is optional in this problem.

    This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate the detection of plot in longer texts, and possibly to use this information for summarisation. Based on whether the student wishes to concentrate on a), b) or c), the task can be attacked in several ways:

    This project is very ambitious, and could lead to a better understanding of the discourse structure of narrative texts. However, the existence of a simple model based on the minimal implementation of a)-c) as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.

    Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation may have to be performed.

    References:

    Lehnert (1981); Plot units and narrative summarization. Cognitive Science

    Mani, Pustejovsky (2004): Temporal discourse models for narrative structure. In: Proceedings of the 2004 ACL Workshop on Discourse Annotation

    Jaynes, Golden (2003): Statistical Detection of Local Coherence Relations in Narrative Recall and Summarization Data. In: Proceedings of the 25th Annual Conference of the Cogntive Science Society.

    Nakhimovsky (1988): Aspect, aspectual class, and the temporal structure of narrative. Computational Linguistics, 1988.

    Kazantseva and Szpakowicz (2010): Summarizing Short Stories. Computational Linguistics 36(1): 71-109

    Lin and Hovy (2003). Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003). ROUGE SW

    Boguraev and Kennedy (1999). Salience-based content characterisation of text documents. In: Advances in Automatic text Summarization, pp. 2--9.

    Gutenberg; short stories

    Project 3: Cross-Discipline Determination of rhetorically charged sentences in scientific writing

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    Detecting innovation in science across an entire field, by analysing the scientific literature automatically, is currently a hot research area. While most approaches rely on a statistical analysis of the words contained in the papers, this project uses parsing and machine learning to detect rhetorically charged sentences, and therefore uncovers innovations which are explicitly declared in the paper. In particular, this project concentrates on the detection of: (Possibly a subset of these). Starting from a set of annotated sentences of these kinds in one domain (computational linguistics), and known indicator phrases for these sentences, parsing and WordNet are to be used to detect similar statements in unseen text of the same domain initially, then text of different domains (chemistry, biochemistry, engineering, computer science). Variations of the meta-discourse phrases will be found by generalising over parser output (``we are the first to present = we present X = X is presented''). Evaluation might rely on a gold standard (created by the student themselves), or on human evaluation of the system output. The student choosing this project should have an interest in pragmatics and semantics and should not shy away from data analysis. Programming language of choice. This is a relatively "safe" project, where a student could achieve a basic system within a few weeks and then steadily improve the system.

    References:

    Project 4: Lexical Chains Plus

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    This project compares "normal" lexical chains to a new form of lexical chains which incorporate anaphora resolution. Lexical Chains are a construct where related concepts are clustered into chains, each of which representing a topic. Each concept is expressed as a disambiguated sense and can only occur on one lexical chain. Algorithms for lexical chain creation exist; the best ones are linear in time (Silber and McCoy, 2002). This project requires a student to implement this algorithm, incorporate it with an anaphora resolver of choice (either your own, or an out-of-the-box one), and to use one of the many applications of lexical chains to evaluate whether this increases results. Incorporation of the anaphora resolver's results is up to the creativity of the student. A simple reweighting scheme for referred to-entities in each lexical chain could be tried first. Known applications for lexical chains are manifold: This is a rather straightforward project, which can be extended to suit the students' skills and interests. In its most basic form, it can be performed with standard summarisation evaluation of an extractive summariser, ROUGE package, and out-of-the-box anaphora resolver. A more ambitious version of this project would test the quality of the lexical chains intrinsically, possibly use a different application of lexical chains, and experiment with the students' own version of a known anaphora resolution algorithm.

    References