Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:
Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.
This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.
Hearst, M. Text Tiling. Computational Linguistics. 1997.
Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.
Barzilay and Lapata, Entity-based coherence, CL 2008.
Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.
cbbc newsround (www.bbc.co.uk) is a news site for children. It contains specially selected, very short news items which contains text simplified for children. The news items are "real" in the sense that they correspond to current (real-time) news items on the "normal" BBC News site (or other sites).
The texts on cbbc newsround are written by journalists. They are short, so some kind of summarisation is taking place. The sentences themselves are also shorter than those in the text addressed at adults, and lexical items have been paraphrased to lower the reading age of the texts (syntacic and lexical simplification). The idea of this project is to simulate one or several of the tasks by which an automatic process could generate such stories, and apply standard summarisation methods to evaluate the generated texts against the human gold standards.
The summarisation algorithm would run on pairs of texts previously harvested from BBC news and cbbc newsround. Whether or not this step of the process is to be automated is up to the student. Depending on the method chosen below, the parallel corpus could be used for training purposes for step one below, possibly for other steps as well.
This project is really three projects in one, so the student will have to choose the direction according to their interests. An ideal student for this project would have ideas of their own, and be rather goal-driven, as several subtasks need to be solved and individually evaluated. The risk is higher than in the project above, but the novelty factor (particularly for the lexical substitution task) is also higher.
Kupiec et al 1995. A trainable document summarizer. SIGIR 95.
Teufel and Moens 1997. Sentence
Extraction as a Classification Task. ACL Summarisation WS
1997.
Zajic, Dorr, Lin, and Schwartz. Multi-candidate reduction: Sentence
compression as a tool for document summarization tasks Information
Processing & Management, 2007.
Siddharthan 2003. Preserving Discourse Structure when Simplifying
text. European Chapter of the Association for on Computational
Linguistics (EACL).
Yatskar, Pang, Danescu-Niculescu-Mizil and Lee 2010. For the sake of simplicity:
Unsupervised extraction of lexical simplifications from
Wikipedia. NAACL 2010.
Kintsch and van Dijk (1978), two psychologists, proposed a
summarisation algorithm based on assumptions of human memory
limitations. It simulates what the human brain would presumably due if
texts is incrementally processed, but only X propositions about a text
can be kept in memory. When each new sentence is read, depending on
the state of the previous queue, new propositions can be collapsed
with existing ones, replace existing ones, or be thrown away. The
factors that decide which of these operations is applied rely on
argument overlap, but also on semantic entailment and generalisation. Once the text has
been processed, a set of propositions remain in memory. These
constitute the summary and are then translated into fluent text.
The model was originally demonstrated by hand-simulation and has been
received well in the 1980s, but has been all but forgotten since then
because it was thought to be unimplementable (then). This is for two
main reasons:
This highly innovative project is a feasibility study in whether
current advances in lexical semantics and parsing are enough to
automate this attractive, explanatory summarization model.
Based on the problems mentioned above, several models can be built:
Evaluation of system output can then proceed in two ways:
The project is very ambitious and innovative, and could lead to an
entirely new type of summarizer which is more explanatory than
summerizers currently used. However, the existence of the simple model
as a fall-back makes this project less risky as it might initially
look: even the successful implementation of the simple model would be
extremely novel, and thus interesting to test against state of the
art.
Due to the speculative nature of the project, the student choosing
this project should be a competent programmer and problem solver who
can work on a task independently. Some annotation/ data preparation
will have to be performed.
Kintsch and van Dijk (1978). Toward a Model of Text Comprehension
and Production. Psychological Review, vol.85, number 5. Hardcopy
available from me upon request (in case you can't get to the library).
Bos, Markert (2005). Recognising textual entailment with logical inference Proceedings
HLT '05.
This project is testing different models of coherence to find regions of coherent
text in scientific discourse, in particular two types of coherent regions:
This project would suit a student who is interested in algorithms
(e.g., the lexical chain algorithm), and who likes data work (e.g.,
looking through dozens of citation blocks, deciding where they start
and end). The student should have good intuition about writing style
in science, and be able to generalise over similarities in writing
style. Programming language of choice.
Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization
Workshop.
Hearst, M. Text Tiling. Computational Linguistics. 1997.
Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.
Barzilay and Lapata, Entity-based coherence, ACL 2005.
This project concerns anaphora resolution in scientific text, with a particular focus on pronouns and demonstrative definite NPs. The project consists of the following stages:
This project will probably require some manual annotation from the student. Definite NP reference is particularly difficult in scientific discourse and will not be addressed, unless particularly fast progress is achieved. A comparison to a baseline algorithm such as lingpipe coreference (which is not specialised to scientific literature) should be performed.
An ideal student for this project
would be rather goal-driven, as several subtasks need to be solved and individually evaluated, and be linguistically interested. Programming language of choice.
Kim and Webber (2006). Automatic reference resolution in astronomy articles. International CODATA Conference, Beijing.
Ge et. al (1998). A
statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora - COLING-ACL’98, Montreal, Canada.
Kaplan and Tokunaga (2009). A citation-based approach to summarisation. NLPIR4DL, ACL workshop, Singapore.
Sentence-based abstract-document alignment has great benefits for summarisation, as sentences in the abstract are often picked from the rest of the document. How many differences there are between such sentence pairs depends on the individual writing style, but overall few changes are observed. The goal of this project is to explore different methods for doing so. A small gold standard corpus of 80 aligned abstract--document pairs exists, which could be expanded with very little work from the student if necessary.
In general, it is hard finding good semantic similarity
metrics for linguistic objects as short as sentences. Both the longest common substring algorithm (Teufel and Moens 1997) and Vector space comparison (e.g., as in Marcu, 99) is known to perform rather weak - such simple algorithms can,
however, be used as a baseline in this project.
There are two flavours of this project. The first is an (as close to exhaustive) exploration of different variables in semantic spaces for this project: syntactic vs. keyword based, and dimensionality reduction methods such as LSI. The second option is to only implement ONE promising semantic space, and to combine this measure of semantic similarity with two additional facts about abstract--document alignment:
The student best suited for this project should be a competent programmer and very systematic (for option1), or mathematically interested and have an interest in developing new algorithms (for option2).
Teufel, S. and Moens, M. (1997). A gold standard for abstract-document alignment. ACL-workshop on Automatic Summarization.
Marcu (1999). The automatic construction of large-scale corpora for summarization research. SIGIR-99.
Project 3: A Summariser based on a model of human memory limitations
Description
References:
2011 Project: Coherence in Scientific Discourse
Description
This project is to explore different coherence-based approaches to finding
the two types of blocks. Options are coherence by lexical chains (Silber and McCoy 2002), coherence by lexical
repetition (Hearst, 1998) and entity-based coherence (Lapata and Barzilay 2005). These approaches have been successful in news texts. The project involves
implementing at least two of the mentioned approaches for at least one of the types of blocks.
Citation blocks might be a promising starting point, as the algorithms can be run on an existing,
citation-parsed corpus of around 16,000 scientific texts in one area. An evaluation method of choice is to be used to determine how well the
algorithms perform relative to each other, and to a
baseline. Evaluation possibilities are a) a gold-standard evaluation,
which means that the student performs some annotation [some annotated material for citation blocks exists] or b) a human
evaluation study, where human subjects are asked if they agree with
the system's boundaries.
References:
2011 Project: An Anaphora resolver for Scientific Discourse
Description
A sub-solution to the scientific anaphora resolution problem has been presented in the subcontext of
reference to citations by Kim and Webber (2006) - who only treat "they" with supervised ML - and Kaplan and Tokunaga (2009) - who use out-of-the-box anaphora resolution.
References:
2011 Project: A Comparison of Semantic Spaces for Abstract-Document Alignment
Description
Option 2 turns the best choice of alignment into a constraint satisfaction problem.
References: