Project 1: A Coherence Checker

Supervisor: Simone Teufel

Special Resources: None

Description

Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:

Lexical Coherence. Lexical and near-lexical repetition should be a strong indicator of coherence. However, this feature is known to be stronger for strong changes in topics, and not expected to work very well for coherence. In the simplest instance, a text tiling style lexical overlap (Hearst, 97) could be used. If time/implementational skills allow, lexical chains (Barzilay and Elhadad 97; Silber and McCoy 02) would provide a better model of lexical coherence.
Anaphora. Anaphoric links should by definition operate within coherent blocks of text. In the first instance, an out of the box algorithm (eg lingpipe) running only on pronouns could be used. The student's main task here is to filter out pleonastic pronouns before this algorithm is run.
Entity-based Coherence. This algorithm models the progression of entities through the focus and attention span of the reader. In a coherent text, strong syntactic patterns about how entities are introduced are followed. The algorithm for ranking discourses by Barzilay and Lapata (2005) requires a parser and an anaphora resolver. One of the challenges is to turn the ranking for comparable discourses into a scoring algorithm for non-comparable discourse snippets.

A very similar project last year (Testuggine 12) found the entity-based model to outperform lexical similarity on the task of finding related work sections in text. This project would build upon that work by a) moving to a "simpler", more clear-cut domain (news paper text). Intellectually, the largest impact of this project is in devising a method for scoring, rather than ranking, coherence.

Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.

This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.

References:

Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

Hearst, M. Text Tiling. Computational Linguistics. 1997.

Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

Barzilay and Lapata, Entity-based coherence, CL 2008.

Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.

Project 2: A News Summariser for Children

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

cbbc newsround (www.bbc.co.uk) is a news site for children. It contains specially selected, very short news items which contains text simplified for children. The news items are "real" in the sense that they correspond to current (real-time) news items on the "normal" BBC News site (or other sites).

The texts on cbbc newsround are written by journalists. They are short, so some kind of summarisation is taking place. The sentences themselves are also shorter than those in the text addressed at adults, and lexical items have been paraphrased to lower the reading age of the texts (syntacic and lexical simplification). The idea of this project is to simulate one or several of the tasks by which an automatic process could generate such stories, and apply standard summarisation methods to evaluate the generated texts against the human gold standards.

The summarisation algorithm would run on pairs of texts previously harvested from BBC news and cbbc newsround. Whether or not this step of the process is to be automated is up to the student. Depending on the method chosen below, the parallel corpus could be used for training purposes for step one below, possibly for other steps as well.

Alignment of selected sentences with summarisation input (eg., using vector space similarity or edit distance like Kupiec et al 1995; Teufel and Moens 1997). This step is the minimal requirement for this project.

The next steps are the intellectual core of this project. Various directions are possible, from lexical similarity detection to syntactic simplification:

Syntactic simplification of sentences (e.g. relative clause dis-embedding as in Siddharthan 2003) or existing sentence-level summarisers (e.g., Zajac et al. 2007)
Fixing pronoun resolution in the resulting text
Lexical substitution of low-frequency words with higher-frequency ones that are attested to be similar (occur in similar contexts; e.g. Yatskar et al. 2010)

Evaluation will be performed by standard unigram-based summarisation evaluation. Human evaluation is possible (and would add value to the project), but is not strictly required because gold-standard tests are available.

This project is really three projects in one, so the student will have to choose the direction according to their interests. An ideal student for this project would have ideas of their own, and be rather goal-driven, as several subtasks need to be solved and individually evaluated. The risk is higher than in the project above, but the novelty factor (particularly for the lexical substitution task) is also higher.

References:

Kupiec et al 1995. A trainable document summarizer. SIGIR 95.

Teufel and Moens 1997. Sentence Extraction as a Classification Task. ACL Summarisation WS 1997.

Zajic, Dorr, Lin, and Schwartz. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks Information Processing & Management, 2007.

Siddharthan 2003. Preserving Discourse Structure when Simplifying text. European Chapter of the Association for on Computational Linguistics (EACL).

Yatskar, Pang, Danescu-Niculescu-Mizil and Lee 2010. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. NAACL 2010.

Project 3: A Summariser based on a model of human memory limitations

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

Kintsch and van Dijk (1978), two psychologists, proposed a summarisation algorithm based on assumptions of human memory limitations. It simulates what the human brain would presumably due if texts is incrementally processed, but only X propositions about a text can be kept in memory. When each new sentence is read, depending on the state of the previous queue, new propositions can be collapsed with existing ones, replace existing ones, or be thrown away. The factors that decide which of these operations is applied rely on argument overlap, but also on semantic entailment and generalisation. Once the text has been processed, a set of propositions remain in memory. These constitute the summary and are then translated into fluent text.

The model was originally demonstrated by hand-simulation and has been received well in the 1980s, but has been all but forgotten since then because it was thought to be unimplementable (then). This is for two main reasons:

The input into the simulated model used in the paper was manually created. However, recent advances in parsing mean that dependency structures producable today can arguably stand in well enough for propositions. Coreference resolution "out of the box" would additionally be used to improve the recognition of argument overlap.
The other major stumbling block is that the model is the automatic determination of the relation between incoming propositions. Recent advances in recognising textual entailment could help here (Pascal Textual Entailment Task). Generalisation between arguments could be simulated using WordNet. Discourse markers could be used.
The result propositions were then manually turned into sentences. This task should probably best be ignored in the current project, as it distracts from the real problems addressed.

This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate this attractive, explanatory summarization model. Based on the problems mentioned above, several models can be built:

Simple model which operates only on overlap of arguments in propositions (applies the same stacking operations each time)
Sophisticated model which additionally determines entailment and/or generalisation relations and modifies stacking operations accordingly

Evaluation of system output can then proceed in two ways:

A comparison of the system output with the predicted output of the model if played through manually (which the student would have to be perform)
A human experiment where the acceptability of the output is judged.

The project is very ambitious and innovative, and could lead to an entirely new type of summarizer which is more explanatory than summerizers currently used. However, the existence of the simple model as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.

Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation will have to be performed.

References:

Kintsch and van Dijk (1978). Toward a Model of Text Comprehension and Production. Psychological Review, vol.85, number 5. Hardcopy available from me upon request (in case you can't get to the library).

Bos, Markert (2005). Recognising textual entailment with logical inference Proceedings HLT '05.

2011 Project: Coherence in Scientific Discourse

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

This project is testing different models of coherence to find regions of coherent text in scientific discourse, in particular two types of coherent regions:

Citation blocks. A citation block is a contiguous area in a scientific paper, which semantically "belongs to" the citation, i.e., which describes content related to the citation. For the purpose of this project, we will define the borders of a citation block to coincide with sentence boundaries.
Algorithm blocks. An algorithm block is a sequence of instructions in running scientific text which give a technical description of how to solve a subproblem (which is typically described at the beginning of the algorithm block.

This project is to explore different coherence-based approaches to finding the two types of blocks. Options are coherence by lexical chains (Silber and McCoy 2002), coherence by lexical repetition (Hearst, 1998) and entity-based coherence (Lapata and Barzilay 2005). These approaches have been successful in news texts. The project involves implementing at least two of the mentioned approaches for at least one of the types of blocks. Citation blocks might be a promising starting point, as the algorithms can be run on an existing, citation-parsed corpus of around 16,000 scientific texts in one area. An evaluation method of choice is to be used to determine how well the algorithms perform relative to each other, and to a baseline. Evaluation possibilities are a) a gold-standard evaluation, which means that the student performs some annotation [some annotated material for citation blocks exists] or b) a human evaluation study, where human subjects are asked if they agree with the system's boundaries.

This project would suit a student who is interested in algorithms (e.g., the lexical chain algorithm), and who likes data work (e.g., looking through dozens of citation blocks, deciding where they start and end). The student should have good intuition about writing style in science, and be able to generalise over similarities in writing style. Programming language of choice.

References:

Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

Hearst, M. Text Tiling. Computational Linguistics. 1997.

Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

Barzilay and Lapata, Entity-based coherence, ACL 2005.

2011 Project: An Anaphora resolver for Scientific Discourse

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

This project concerns anaphora resolution in scientific text, with a particular focus on pronouns and demonstrative definite NPs. The project consists of the following stages:

Automatic determination of pleonastics ("it is important").
Application and adaptation of an existing anaphora resolving algorithm -- e.g., Ge et al's (2002) supervised ML - to the specialities of scientific discourse:
- Pronouns tend to be 3rd person
- High frequency and importance of demonstrative definite NPs ("this solution", "those systems")
- Particularly many, and difficult to interpret, bridging anaphora

A sub-solution to the scientific anaphora resolution problem has been presented in the subcontext of reference to citations by Kim and Webber (2006) - who only treat "they" with supervised ML - and Kaplan and Tokunaga (2009) - who use out-of-the-box anaphora resolution.

This project will probably require some manual annotation from the student. Definite NP reference is particularly difficult in scientific discourse and will not be addressed, unless particularly fast progress is achieved. A comparison to a baseline algorithm such as lingpipe coreference (which is not specialised to scientific literature) should be performed.

An ideal student for this project would be rather goal-driven, as several subtasks need to be solved and individually evaluated, and be linguistically interested. Programming language of choice.

References:

Kim and Webber (2006). Automatic reference resolution in astronomy articles. International CODATA Conference, Beijing.

Ge et. al (1998). A statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora - COLING-ACL’98, Montreal, Canada.

Kaplan and Tokunaga (2009). A citation-based approach to summarisation. NLPIR4DL, ACL workshop, Singapore.

2011 Project: A Comparison of Semantic Spaces for Abstract-Document Alignment

Proposer:Simone Teufel

Supervisor: Simone Teufel

Special Resources: None

Description

Sentence-based abstract-document alignment has great benefits for summarisation, as sentences in the abstract are often picked from the rest of the document. How many differences there are between such sentence pairs depends on the individual writing style, but overall few changes are observed. The goal of this project is to explore different methods for doing so. A small gold standard corpus of 80 aligned abstract--document pairs exists, which could be expanded with very little work from the student if necessary. In general, it is hard finding good semantic similarity metrics for linguistic objects as short as sentences. Both the longest common substring algorithm (Teufel and Moens 1997) and Vector space comparison (e.g., as in Marcu, 99) is known to perform rather weak - such simple algorithms can, however, be used as a baseline in this project.

There are two flavours of this project. The first is an (as close to exhaustive) exploration of different variables in semantic spaces for this project: syntactic vs. keyword based, and dimensionality reduction methods such as LSI. The second option is to only implement ONE promising semantic space, and to combine this measure of semantic similarity with two additional facts about abstract--document alignment:

Alignments often preserve the relative order observed (abstract vs. document)
Alignments often preserve the rhetorical status of the sentence (this is marked in the existing gold standard).

Option 2 turns the best choice of alignment into a constraint satisfaction problem.

The student best suited for this project should be a competent programmer and very systematic (for option1), or mathematically interested and have an interest in developing new algorithms (for option2).

References:

Teufel, S. and Moens, M. (1997). A gold standard for abstract-document alignment. ACL-workshop on Automatic Summarization.

Marcu (1999). The automatic construction of large-scale corpora for summarization research. SIGIR-99.