Project 1: A Coherence Checker

Supervisor: Simone Teufel

Special Resources: None

Description

Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:

Lexical Coherence. Lexical and near-lexical repetition should be a strong indicator of coherence. However, this feature is known to be stronger for strong changes in topics, and not expected to work very well for coherence. In the simplest instance, a text tiling style lexical overlap (Hearst, 97) could be used. If time/implementational skills allow, lexical chains (Barzilay and Elhadad 97; Silber and McCoy 02) would provide a better model of lexical coherence.
Anaphora. Anaphoric links should by definition operate within coherent blocks of text. In the first instance, an out of the box algorithm (eg lingpipe) running only on pronouns could be used. The student's main task here is to filter out pleonastic pronouns before this algorithm is run.
Entity-based Coherence. This algorithm models the progression of entities through the focus and attention span of the reader. In a coherent text, strong syntactic patterns about how entities are introduced are followed. The algorithm for ranking discourses by Barzilay and Lapata (2005) requires a parser and an anaphora resolver. One of the challenges is to turn the ranking for comparable discourses into a scoring algorithm for non-comparable discourse snippets.

A very similar project last year (Testuggine 12) found the entity-based model to outperform lexical similarity on the task of finding related work sections in text. This project would build upon that work by a) moving to a "simpler", more clear-cut domain (news paper text). Intellectually, the largest impact of this project is in devising a method for scoring, rather than ranking, coherence.

Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.

This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.

References:

Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

Hearst, M. Text Tiling. Computational Linguistics. 1997.

Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

Barzilay and Lapata, Entity-based coherence, CL 2008.

Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.

Project 2: A News Summariser for Children

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

cbbc newsround (www.bbc.co.uk) is a news site for children. It contains specially selected, very short news items which contains text simplified for children. The news items are "real" in the sense that they correspond to current (real-time) news items on the "normal" BBC News site (or other sites).

The texts on cbbc newsround are written by journalists. They are short, so some kind of summarisation is taking place. The sentences themselves are also shorter than those in the text addressed at adults, and lexical items have been paraphrased to lower the reading age of the texts (syntacic and lexical simplification). The idea of this project is to simulate one or several of the tasks by which an automatic process could generate such stories, and apply standard summarisation methods to evaluate the generated texts against the human gold standards.

The summarisation algorithm would run on pairs of texts previously harvested from BBC news and cbbc newsround. Whether or not this step of the process is to be automated is up to the student. Depending on the method chosen below, the parallel corpus could be used for training purposes for step one below, possibly for other steps as well.

Alignment of selected sentences with summarisation input (eg., using vector space similarity or edit distance like Kupiec et al 1995; Teufel and Moens 1997). This step is the minimal requirement for this project.

The next steps are the intellectual core of this project. Various directions are possible, from lexical similarity detection to syntactic simplification:

Syntactic simplification of sentences (e.g. relative clause dis-embedding as in Siddharthan 2003) or existing sentence-level summarisers (e.g., Zajac et al. 2007)
Fixing pronoun resolution in the resulting text
Lexical substitution of low-frequency words with higher-frequency ones that are attested to be similar (occur in similar contexts; e.g. Yatskar et al. 2010)

Evaluation will be performed by standard unigram-based summarisation evaluation. Human evaluation is possible (and would add value to the project), but is not strictly required because gold-standard tests are available.

This project is really three projects in one, so the student will have to choose the direction according to their interests. An ideal student for this project would have ideas of their own, and be rather goal-driven, as several subtasks need to be solved and individually evaluated. The risk is higher than in the project above, but the novelty factor (particularly for the lexical substitution task) is also higher.

References:

Kupiec et al 1995. A trainable document summarizer. SIGIR 95.

Teufel and Moens 1997. Sentence Extraction as a Classification Task. ACL Summarisation WS 1997.

Zajic, Dorr, Lin, and Schwartz. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks Information Processing & Management, 2007.

Siddharthan 2003. Preserving Discourse Structure when Simplifying text. European Chapter of the Association for on Computational Linguistics (EACL).

Yatskar, Pang, Danescu-Niculescu-Mizil and Lee 2010. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. NAACL 2010.

Project 3: A Summariser based on a model of human memory limitations

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

Kintsch and van Dijk (1978), two psychologists, proposed a summarisation algorithm based on assumptions of human memory limitations. It simulates what the human brain would presumably due if texts is incrementally processed, but only X propositions about a text can be kept in memory. When each new sentence is read, depending on the state of the previous queue, new propositions can be collapsed with existing ones, replace existing ones, or be thrown away. The factors that decide which of these operations is applied rely on argument overlap, but also on semantic entailment and generalisation. Once the text has been processed, a set of propositions remain in memory. These constitute the summary and are then translated into fluent text.

The model was originally demonstrated by hand-simulation and has been received well in the 1980s, but has been all but forgotten since then because it was thought to be unimplementable (then). This is for two main reasons:

The input into the simulated model used in the paper was manually created. However, recent advances in parsing mean that dependency structures producable today can arguably stand in well enough for propositions. Coreference resolution "out of the box" would additionally be used to improve the recognition of argument overlap.
The other major stumbling block is that the model is the automatic determination of the relation between incoming propositions. Recent advances in recognising textual entailment could help here (Pascal Textual Entailment Task). Generalisation between arguments could be simulated using WordNet. Discourse markers could be used.
The result propositions were then manually turned into sentences. This task should probably best be ignored in the current project, as it distracts from the real problems addressed.

This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate this attractive, explanatory summarization model. Based on the problems mentioned above, several models can be built:

Simple model which operates only on overlap of arguments in propositions (applies the same stacking operations each time)
Sophisticated model which additionally determines entailment and/or generalisation relations and modifies stacking operations accordingly

Evaluation of system output can then proceed in two ways:

A comparison of the system output with the predicted output of the model if played through manually (which the student would have to be perform)
A human experiment where the acceptability of the output is judged.

The project is very ambitious and innovative, and could lead to an entirely new type of summarizer which is more explanatory than summerizers currently used. However, the existence of the simple model as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.

Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation will have to be performed.

References:

Kintsch and van Dijk (1978). Toward a Model of Text Comprehension and Production. Psychological Review, vol.85, number 5. Hardcopy available from me upon request (in case you can't get to the library).

Bos, Markert (2005). Recognising textual entailment with logical inference Proceedings HLT '05.