Project 1: A Coherence Checker

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    Coherence in text -- the property of text to "glue together" in some sense -- is known to be based on lexical, syntactic, and discourse-based phenomena. This project is testing different models of coherence to find regions of coherent text in news articles, and possibly if time permists, in scientific discourse. It will test three core methods against each other:

    A very similar project last year (Testuggine 12) found the entity-based model to outperform lexical similarity on the task of finding related work sections in text. This project would build upon that work by a) moving to a "simpler", more clear-cut domain (news paper text). Intellectually, the largest impact of this project is in devising a method for scoring, rather than ranking, coherence.

    Evaluation: following previous work, a "cheap" gold standard of using layout information to infer breaks in coherence could be used. I would like to work with the student on a more convincing definition of truth, which should however still be determinable at least semi-automatically. I.e, this project does not necessarily require human judgement as an evaluation.

    This project would suit a student who has good intuition about writing style and is interested in algorithms (e.g., the lexical chain algorithm). It is an ambitious and work-heavy project of medium novelty (lots of implementation; programming language of choice). It is relatively low-risk, but if successful would be of interest to the general NLP community.

    References:

    Barzilay, R. and Elhadad, M. Using Lexical Chains for Text Summarization. 1997 Summarization Workshop.

    Hearst, M. Text Tiling. Computational Linguistics. 1997.

    Silber and McCoy, An algorithm for lexical chains, Computational Linguistics, 2002.

    Barzilay and Lapata, Entity-based coherence, CL 2008.

    Testuggine, Finding citation blocks by entity-based coherence, ACS thesis 2012.

    Project 2: A News Summariser for Children

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    cbbc newsround (www.bbc.co.uk) is a news site for children. It contains specially selected, very short news items which contains text simplified for children. The news items are "real" in the sense that they correspond to current (real-time) news items on the "normal" BBC News site (or other sites).

    The texts on cbbc newsround are written by journalists. They are short, so some kind of summarisation is taking place. The sentences themselves are also shorter than those in the text addressed at adults, and lexical items have been paraphrased to lower the reading age of the texts (syntacic and lexical simplification). The idea of this project is to simulate one or several of the tasks by which an automatic process could generate such stories, and apply standard summarisation methods to evaluate the generated texts against the human gold standards.

    The summarisation algorithm would run on pairs of texts previously harvested from BBC news and cbbc newsround. Whether or not this step of the process is to be automated is up to the student. Depending on the method chosen below, the parallel corpus could be used for training purposes for step one below, possibly for other steps as well.

    The next steps are the intellectual core of this project. Various directions are possible, from lexical similarity detection to syntactic simplification: Evaluation will be performed by standard unigram-based summarisation evaluation. Human evaluation is possible (and would add value to the project), but is not strictly required because gold-standard tests are available.

    This project is really three projects in one, so the student will have to choose the direction according to their interests. An ideal student for this project would have ideas of their own, and be rather goal-driven, as several subtasks need to be solved and individually evaluated. The risk is higher than in the project above, but the novelty factor (particularly for the lexical substitution task) is also higher.

    References:

    Kupiec et al 1995. A trainable document summarizer. SIGIR 95.

    Teufel and Moens 1997. Sentence Extraction as a Classification Task. ACL Summarisation WS 1997.

    Zajic, Dorr, Lin, and Schwartz. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks Information Processing & Management, 2007.

    Siddharthan 2003. Preserving Discourse Structure when Simplifying text. European Chapter of the Association for on Computational Linguistics (EACL).

    Yatskar, Pang, Danescu-Niculescu-Mizil and Lee 2010. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. NAACL 2010.

    Project 3: A Summariser based on a model of human memory limitations

  • Proposer:Simone Teufel
  • Supervisor: Simone Teufel
  • Special Resources: None

    Description

    Kintsch and van Dijk (1978), two psychologists, proposed a summarisation algorithm based on assumptions of human memory limitations. It simulates what the human brain would presumably due if texts is incrementally processed, but only X propositions about a text can be kept in memory. When each new sentence is read, depending on the state of the previous queue, new propositions can be collapsed with existing ones, replace existing ones, or be thrown away. The factors that decide which of these operations is applied rely on argument overlap, but also on semantic entailment and generalisation. Once the text has been processed, a set of propositions remain in memory. These constitute the summary and are then translated into fluent text.

    The model was originally demonstrated by hand-simulation and has been received well in the 1980s, but has been all but forgotten since then because it was thought to be unimplementable (then). This is for two main reasons:

    This highly innovative project is a feasibility study in whether current advances in lexical semantics and parsing are enough to automate this attractive, explanatory summarization model. Based on the problems mentioned above, several models can be built:

    Evaluation of system output can then proceed in two ways:

    The project is very ambitious and innovative, and could lead to an entirely new type of summarizer which is more explanatory than summerizers currently used. However, the existence of the simple model as a fall-back makes this project less risky as it might initially look: even the successful implementation of the simple model would be extremely novel, and thus interesting to test against state of the art.

    Due to the speculative nature of the project, the student choosing this project should be a competent programmer and problem solver who can work on a task independently. Some annotation/ data preparation will have to be performed.

    References:

    Kintsch and van Dijk (1978). Toward a Model of Text Comprehension and Production. Psychological Review, vol.85, number 5. Hardcopy available from me upon request (in case you can't get to the library).

    Bos, Markert (2005). Recognising textual entailment with logical inference Proceedings HLT '05.