Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:
Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.
This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.
Hearst, M. Text Tiling. Computational Linguistics. 1997.
Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.
Barzilay and Lapata, Entity-based coherence, CL 2008.
Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.
cbbc newsround (www.bbc.co.uk) is a news site for children. It contains specially selected, very short news items which contains text simplified for children. The news items are "real" in the sense that they correspond to current (real-time) news items on the "normal" BBC News site (or other sites).
The texts on cbbc newsround are written by journalists. They are short, so some kind of summarisation is taking place. The sentences themselves are also shorter than those in the text addressed at adults, and lexical items have been paraphrased to lower the reading age of the texts (syntacic and lexical simplification). The idea of this project is to simulate one or several of the tasks by which an automatic process could generate such stories, and apply standard summarisation methods to evaluate the generated texts against the human gold standards.
The summarisation algorithm would run on pairs of texts previously harvested from BBC news and cbbc newsround. Whether or not this step of the process is to be automated is up to the student. Depending on the method chosen below, the parallel corpus could be used for training purposes for step one below, possibly for other steps as well.
This project is really three projects in one, so the student will have to choose the direction according to their interests. An ideal student for this project would have ideas of their own, and be rather goal-driven, as several subtasks need to be solved and individually evaluated. The risk is higher than in the project above, but the novelty factor (particularly for the lexical substitution task) is also higher.
Kupiec et al 1995. A trainable document summarizer. SIGIR 95.
Teufel and Moens 1997. Sentence
Extraction as a Classification Task. ACL Summarisation WS
1997.
Zajic, Dorr, Lin, and Schwartz. Multi-candidate reduction: Sentence
compression as a tool for document summarization tasks Information
Processing & Management, 2007.
Siddharthan 2003. Preserving Discourse Structure when Simplifying
text. European Chapter of the Association for on Computational
Linguistics (EACL).
Yatskar, Pang, Danescu-Niculescu-Mizil and Lee 2010. For the sake of simplicity:
Unsupervised extraction of lexical simplifications from
Wikipedia. NAACL 2010.
Kintsch and van Dijk (1978), two psychologists, proposed a
summarisation algorithm based on assumptions of human memory
limitations. It simulates what the human brain would presumably due if
texts is incrementally processed, but only X propositions about a text
can be kept in memory. When each new sentence is read, depending on
the state of the previous queue, new propositions can be collapsed
with existing ones, replace existing ones, or be thrown away. The
factors that decide which of these operations is applied rely on
argument overlap, but also on semantic entailment and generalisation. Once the text has
been processed, a set of propositions remain in memory. These
constitute the summary and are then translated into fluent text.
The model was originally demonstrated by hand-simulation and has been
received well in the 1980s, but has been all but forgotten since then
because it was thought to be unimplementable (then). This is for two
main reasons:
This highly innovative project is a feasibility study in whether
current advances in lexical semantics and parsing are enough to
automate this attractive, explanatory summarization model.
Based on the problems mentioned above, several models can be built:
Evaluation of system output can then proceed in two ways:
The project is very ambitious and innovative, and could lead to an
entirely new type of summarizer which is more explanatory than
summerizers currently used. However, the existence of the simple model
as a fall-back makes this project less risky as it might initially
look: even the successful implementation of the simple model would be
extremely novel, and thus interesting to test against state of the
art.
Due to the speculative nature of the project, the student choosing
this project should be a competent programmer and problem solver who
can work on a task independently. Some annotation/ data preparation
will have to be performed.
Kintsch and van Dijk (1978). Toward a Model of Text Comprehension
and Production. Psychological Review, vol.85, number 5. Hardcopy
available from me upon request (in case you can't get to the library).
Bos, Markert (2005). Recognising textual entailment with logical inference Proceedings
HLT '05.
Project 3: A Summariser based on a model of human memory limitations
Description
References: