Project 1: Modelling Lexical Coherence

Supervisor: Simone Teufel

Special Resources: None

Description

Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:

Lexical Coherence. Lexical and near-lexical repetition should be a strong indicator of coherence. However, this feature is known to be stronger for strong changes in topics, and not expected to work very well for coherence. In the simplest instance, a text tiling style lexical overlap (Hearst, 97) could be used. If time/implementational skills allow, lexical chains (Barzilay and Elhadad 97; Silber and McCoy 02) would provide a better model of lexical coherence.
Anaphora. Anaphoric links should by definition operate within coherent blocks of text. In the first instance, an out-of-the box algorithm (eg lingpipe) running only on pronouns could be used. The student's main task here is to filter out pleonastic pronouns before this algorithm is run.
Entity-based Coherence. This algorithm models the progression of entities through the focus and attention span of the reader. In a coherent text, strong syntactic patterns about how entities are introduced are followed. The algorithm for ranking discourses by Barzilay and Lapata (2005) requires a parser and an anaphora resolver. One of the challenges is to turn the ranking for comparable discourses into a scoring algorithm for non-comparable discourse snippets.

A very similar project 2 years ago (Testuggine 2012) found the entity-based model to outperform lexical similarity on the task of finding related work sections in text. This project would build upon that work by a) moving to a "simpler", more clear-cut domain (news paper text). Intellectually, the largest impact of this project is in devising a method for scoring, rather than ranking, coherence.

Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.

This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.

References:

Hearst, M. Text Tiling. Computational Linguistics. 1997.

Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

Barzilay and Lapata, Entity-based coherence, CL 2008.

Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.

Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

Project 2: Discovering Plots in Narrative Texts

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

Plots in narrative texts are based on a) the main particants in the story b) the main events in a story and c) the time line. Textual coherence is of importance in the analysis of plots as well. The aim of this project is to build a model of the three main plot components in short stories using named entity recognition, a parser, and a model of lexical coherence. Once a) b) and/or c) are detected, it is possible to generate different kinds of summaries, depending on whether we want to summarise the story for somebody who is yet to read the text (in which case we do not want to give away the plot, but give a feeling for the setting of the text), or somebody who does not intend to read the text but wants a summary for other purposes (in which case we want to summarise the main points).

I propose to use older short stories of foreign origin, where the copyright has lapsed, and where there is often more than one translation. This will allow us to test our models of lexical cohension. Knowing that both translations contain basically the same information will help pinpoint problems with detection of coherence.

Another advantage of using well-known older novels is that multiple human-written summaries (both "blurbs" and cliff-note-type informative summaries) exist for these texts, which will allow the student to use the ROUGE methodology as cheap evaluation. More fine-grained and informative evaluation would include the elicitation of human judgements, but is optional in this problem.

This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate the detection of plot in longer texts, and possibly to use this information for summarisation. Based on whether the student wishes to concentrate on a), b) or c), the task can be attacked in several ways:

A standard coreference system can be used for a), but more research can be undertaken in specialised models, which are likely to require some annotation on the part of the student
To detect events (part b), verb semantic clustering can be applied. Parser output can be analysed to give tense information.
Grammars for time expressions exist and could be used in a black-box manner. However, recognising and representing a temporal framework for a short story remains a research challenge.

This project is very ambitious, and could lead to a better understanding of the discourse structure of narrative texts. However, the existence of a simple model based on the minimal implementation of a)-c) as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.

Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation may have to be performed.

References:

Lehnert (1981); Plot units and narrative summarization. Cognitive Science

Mani, Pustejovsky (2004): Temporal discourse models for narrative structure. In: Proceedings of the 2004 ACL Workshop on Discourse Annotation

Jaynes, Golden (2003): Statistical Detection of Local Coherence Relations in Narrative Recall and Summarization Data. In: Proceedings of the 25th Annual Conference of the Cogntive Science Society.

Nakhimovsky (1988): Aspect, aspectual class, and the temporal structure of narrative. Computational Linguistics, 1988.

Kazantseva and Szpakowicz (2010): Summarizing Short Stories. Computational Linguistics 36(1): 71-109

Lin and Hovy (2003). Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003). ROUGE SW

Boguraev and Kennedy (1999). Salience-based content characterisation of text documents. In: Advances in Automatic text Summarization, pp. 2--9.

Gutenberg; short stories

Project 3: Cross-Discipline Determination of rhetorically charged sentences in scientific writing

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

Detecting innovation in science across an entire field, by analysing the scientific literature automatically, is currently a hot research area. While most approaches rely on a statistical analysis of the words contained in the papers, this project uses parsing and machine learning to detect rhetorically charged sentences, and therefore uncovers innovations which are explicitly declared in the paper. In particular, this project concentrates on the detection of:

Statements of innovation ("to our knowledge, we are the first to...")
Naming statements of own artefacts("a process we name MILRED")
Criticisms of other researchers ("Miller and Berger, 1998, fail to recognise the importance of \dots").

(Possibly a subset of these). Starting from a set of annotated sentences of these kinds in one domain (computational linguistics), and known indicator phrases for these sentences, parsing and WordNet are to be used to detect similar statements in unseen text of the same domain initially, then text of different domains (chemistry, biochemistry, engineering, computer science). Variations of the meta-discourse phrases will be found by generalising over parser output (``we are the first to present = we present X = X is presented''). Evaluation might rely on a gold standard (created by the student themselves), or on human evaluation of the system output. The student choosing this project should have an interest in pragmatics and semantics and should not shy away from data analysis. Programming language of choice. This is a relatively "safe" project, where a student could achieve a basic system within a few weeks and then steadily improve the system.

References:

Teufel (2010). The Structure of Scientific Articles. CSLI Publications. Chapter 6.3. (In Library).
Teufel and Siddharthan and Tidhar (2006). Automatic classification of Citation Function. In: Proceedings of EMNLP.
Lisacek, F. Chichester, C., Kaplan, A, Sandor, A. (2005) Discovering paradigm shift sentences in biomedical abstracts. In: International Symposium on Semantic Mining in Biomedicine (SMBM).

Project 4: Lexical Chains Plus

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

This project compares "normal" lexical chains to a new form of lexical chains which incorporate anaphora resolution. Lexical Chains are a construct where related concepts are clustered into chains, each of which representing a topic. Each concept is expressed as a disambiguated sense and can only occur on one lexical chain. Algorithms for lexical chain creation exist; the best ones are linear in time (Silber and McCoy, 2002). This project requires a student to implement this algorithm, incorporate it with an anaphora resolver of choice (either your own, or an out-of-the-box one), and to use one of the many applications of lexical chains to evaluate whether this increases results. Incorporation of the anaphora resolver's results is up to the creativity of the student. A simple reweighting scheme for referred to-entities in each lexical chain could be tried first. Known applications for lexical chains are manifold:

summarisation
detection of malaproprisms
navigational tools

This is a rather straightforward project, which can be extended to suit the students' skills and interests. In its most basic form, it can be performed with standard summarisation evaluation of an extractive summariser, ROUGE package, and out-of-the-box anaphora resolver. A more ambitious version of this project would test the quality of the lexical chains intrinsically, possibly use a different application of lexical chains, and experiment with the students' own version of a known anaphora resolution algorithm.

References

Silber and McCoy (2002) An algorithm for lexical chains, Computational Linguistics.
Kennedy and Boguraev (1998). Anaphora for everyone: Pronominal anaphora resolution without a parser. . In: Proceedings of Coling 1996.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.