ACS student project suggestions 2015/16 -- Simone Teufel

Project 1: Sentence Alignment for Summarisation

Project Description

This project provides an important prerequisite for supervised machine learning methods, namely an alingment of sentences from the abstract with sentences from the document body of scientific articles.

One form of state-of-the-art summarisation relies on supervised machine learning of lexical statistics from pairs of sentences from abstracts and sentences from documents. This project develops a method of determining the best alignment of such sentence pairs, from a large set of models of semantic similarity.

The project combines distributional models of sentence similarity with discourse-derived information to find the best overlap of sentences. Marcu (1999) proposes a clause-based method of determining where in a text and abstract clause comes from, based on cosine similarity.

The project will reimplement the Marcu model, but test replacing the cosine similarity with the Longest Common Substring and with various distributional models of clause similarity.

80 papers' abstract sentences have been manually annotated with sentences in the abstract (but not at clause-level). This corpus can be used for evaluation.

Literature

Daniel Marcu (1999). The automatic construction of large-scale corpora for summarization research.The 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), pages 137-144, Berkeley, CA, August 1999.

Project 2: Automatic Model of Scientific Argumentation

Project Description

Scientific argumentation can be seen as a sequence of speech acts operating in an argumentation game, where the highest-level goal is the justification of the current paper. Intermediate goals are the sub-argument that the work presented is novel, or that the work presented constitutes an improvement over existing work. Successful recognition of speech acts will lead to the higher-level intentions and eventually the overall goal. (This is what an argumentation graph looks like.)

We are particularly interested in a particular kind of speech act here, those involving mentions of other work -- citations, author names, names of approaches associated with the authors, possessive and personal pronouns, and statements about the relationship between this mention and the current paper. For each noun phrase, we need the following two factors:

"grounding" in terms of a link to one or more of the citations listed at the end of the paper
a probability that expresses the certainty that the noun phrase encountered in a given sentence actually does refer to that citation.
a classification as one of the 23 listed speech acts from Teufel(2010) in which the noun phrase participates
a probability that expresses the certainty that the speech act was indeed found in the paper.

Evaluation will compare existing sentences pre-annotated (known to be containing moves involving existing works) with the system's first choice for each noun phrase. 200,000 sentences pre-annotated at the sentence level exit, but the student may have to do some annotation him/herself. The student choosing this project will have to make their own choices wrt machine learning algorithm and creatively develop an algorithm for a new task definition which is located between named entity recognition, citation classification and coreference resolution.

Literature

Siddharthan and Teufel (2007). Whose idea was this and why does it matter: Attributing scientific work to citations'. in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007). ACL, Morristown, NJ, USA, pp. 316-323, Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), Rochester, New York, United States, 22-27 April.

Project 3: Automatic Identification of Creativity and Innovativeness in Scientific Writing

Project Description

This project proposes the development of an indicator of innovativeness, in order to improve bibliometric assessment of science. Bibliometrics is the science of assessing the quality of the research output of researchers or universities on the basis of their research output, for instance the UK's Research Excellence Framework [1]. A related task is IARPA's FUSE program [1], which seeks to detect emerging opportunities in science and technology as early as possible. Its fundamental hypothesis is that real-world processes of technical emergence leaves discernible traces in the public scientific, technical and patent literature. Most of the current science indicators are citation-based. The degree of innovativeness of a paper is an aspect of emergence that is closely related to this idea.

It is commonly believed that high impact papers are innovative. However, some highly cited papers are conforming, document incremental research and tend to reinforce the status quo [3]. Innovativeness cannot therefore be assessed by purely looking at citation count. One can try to approach the problem of identifying innovative scientific papers using citation networks to [3, 4]. This approach is based on the idea that innovative papers maximally disrupt the existing citation structure of the topic.

It has also been long assumed that access to full text would result in better innovation finding. This is examplified by a related problem of identifying "paradigm shifts" [5]. The current project follows along this research avenue, and attempts to add information about sentences such as the following to the search for innovativeness:

This result challenges the claims of recent discourse theories (Grosz and Sidner 1986; Reichman 1985) which argue for a the close relation between cue words and discourse structure.

Our US collaborators Richard Klavans and Kevin Boyack have completed a survey where highly influential biomedical scientists rated 10 of their high-cited papers. Despite the deeply subjective nature of innovativeness, the authors themselves are certainly best in a position to assess how innovative their own papers are, if the self-elicitation is performed in an honest, trusted manner, where the reputation of the informant is not threatened. Klavans and Boyak achieved this by asking only about those papers which are of high-impact anyway. The data from this survey allows us to classify the 1200 papers as being innovative, progressive, or mediocre.

The methodology followed in this project relies on performing Argumentative Zoning classification on the corpus of full-text for the 1200 papers, and to find a correlation between the rhetorical "footprint" of a paper (derived via AZ) and its level of innovation. The "rhetorical footprint" will be based on AZ-based features, which are fed into a machine learning system that correlates these features to the papers' innovativeness status.

Practicalities

Most of the corpus is already acquired in full text and has been transformed into a uniform XML format (SciXML). The first step of this project is to unify the rest of the corpus into SciXML. The existing implementation of Argumentative Zoning ([6,7] cf. other project descriptions) can then be run on the new medical corpus.

AZ currently relies on supervised machine learning. It has been trained on annotated articles from computational linguistics and chemistry [7]. However, the mentioned corpus we will use for learning innovation contains articles from biomedical science. An early stage of this project will therefore assess whether the classification of the AZ system trained on chemistry and CL is adequate for helping in the innovativeness classification, or whether its lexical resources need to be manually adapted.

Literature

[1] Research excellence framework (REF). http://www.ref.ac.uk/background/ bibliometrics/.
[2] D. A. Murdick, Foresight and understanding from scientific exposition (FUSE). http://www.iarpa.gov/Programs/ia/FUSE/fuse.html.
[3] R. Klavans, K. W. Boyack, A. A. Sorensen, and C. Chen, Towards the development of an indicator of conformity,
[4] C. Chen, Y. Chen, M. Horowitz, H. Hou, Z. Liu, and D. Pellegrino, Towards an explanatory and computational theory of scientific discovery, Journal of Informetrics, vol. 3, no. 3, pp. 191-209, 2009.
[5] F. Lisacek, C. Chichester, A. Kaplan, and A. Sandor, Discovering paradigm shift patterns in biomedical abstracts: application to neurodegenerative diseases, In: first international symposium on semantic mining in biomedicine, pp. 11-13, Citeseer, 2005.
[6] S. Teufel and M. Moens, Summarizing scientific articles: experiments with relevance and rhetorical status, Computational linguistics, vol. 28, no. 4, pp. 409--445, 2002.
[7] S. Teufel, A. Siddharthan, and C. Batchelor, Towards discipline-independent argumentative zoning: Evidence from chemistry and computational linguistics, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, pp. 1493--1502, Association for Computational Linguistics, 2009

Project 4: Improving the Output of a Proposition-based Summariser

Project Originator

Simone Teufel

Project Supervisor

Simone Teufel and Yimai Fang

Project Description

This project aims to create more grammatical output for an existing prototype summariser which is based on propositions -- shallow semantic representations, which the summariser uses to build a simple discourse model. On the basis of this model, it can decide which propositions are the texts' most important ones.

And this is where the trouble begins. What this project is addressing is what happens once these propositions have been chosen, not the main mechanism of the summariser. The current, clearly suboptimal, output solution is to print the lexical part of the selected propositions in text-order. This produces texts such as:

Some of the beetles, which spend their lives eating and breeding in dung. 4,000 species have evolved to climates the dung of animals. The soft cattle dung in which flies flies breed. Dung-breeding flies. A time, into cow pats.

This text was produced by simple extraction from propositions such as (ranked by importance):

6(1.13) spend (which; lives, eating) 12(1.21) in (eating; dung) 83(7.24) into (a time; cow pats) 11(1.19) and (eating, breeding) 9(1.15+) POSSESS (their; lives) 21(2.22) of (the dung; animals) 15(2.10) have evolved (4,000 species; to: climates, to: the dung) 1(1.2) of (Some; the beetles)

The summariser is very good at handling propositions and beats other summarisers on its content extraction ability, but for human consumption the output is too rough.

One might ask oneself why we choose to output text that is at proposition-level, if it produces texts of such low quality, rather than extracting sentences, which is obviously much easier. The reason is that the proposition-level unit gives our summariser an edge over sentence-extractors, which cannot bundle information as tightly as this summariser can. However, there are many ways how this can be done better than by lexeme extraction. The task of the student who chooses this project is to make the output smoother.

The overal structure of this program will be as an overgenerate-and-rank model. This project will use a subcategorisation lexicon and the n-gram method of shallow generation to produce sentences from the propositions which:

are grammatical (or at least more grammatical than the current solution); for this we will use a subcategorisation lexicon (Korhonen et al. 2003)
read naturally without distorting the meaning of the original text (variations of shallow generation can be used for this, e.g., by ngram-models (Langkilde and Knight 2002) or by a knapsack generation algorithm (Nishikawa et al. 2014).

Literature

Y. Fang, S. Teufel. 2014. A summariser based on human memory limitations and lexical competition. In: Proceedings of EACL-14, Gothenburg, Sweden.
Anna Korhonen, Yuval Krymolowski and Ted Briscoe (2006). A Large Subcategorization Lexicon for Natural Language Processing Applications. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Genova, Italy.
H Nishikawa, K Arita, K Tanaka, T Hirao, T Makino, Y Matsuo, Learning to Generate Coherent Summary with Discriminative Hidden Semi-Markov Model The 25th International Conference on Computational Linguistics (Coling 2014)

Project 5: Is Fido really sick? Sequence learning applied to Disease Indicators in a veterinary context

Project Originator

Noel Kennedy

Project Supervisor

Simone Teufel and Noel Kennedy

Project Description

This project will improve an information retrieval system that is currently in use at the Royal Veterinary College in London.

The IR system indexes the clinical data in the VetCompass project. VetCompass is a not-for-profit organisation which seeks to improve animal welfare by improving the understanding of animal diseases. VetCompass holds clinical records for 4m animals and the IR system indexes around 130m documents. This project addresses a key problem in clinical research (including human clinical research). In contrast to human medical data, vetinary data access is magnitudes easier and cheaper.

The particular problem addressed is the following: a vet is searching for cases of patients that have a certain disease, and enter variations of the disease name, i.e. 'diabetes' or 'dm'. If a case note (document) contains a mention of a particular disease, there is only a 33% chance that the patient actually has that disease, i.e., the False Positive (FP) rate is 67%. In those cases, the disease appeared in a negated, hypothetical or attributed context. The True Positive (TP) rate varies among different diseases and ranges from 12% to 63%. This project will collect evidence about disease references, in particular the sequence in which related evidence occurs, in order to improve this situation and lower the FP rate.

The vets only look at documents which contain at least one disease-relevant token. They use domain knowledge to interpret at least one whole document per patient. They are looking for enough cumulative evidence from multiple sentences, which they use to determine if the patient meets their criteria or not. The researcher makes the classification decision, for each patient, at potentially two different points:

A positive decision is made after reading enough relevant documents with enough positive evidence that warrants classifying the patient as a positive case. This point is typically reached part-way through all the relevant documents for each patient.
A negative decision can only be reached after the last relevant document was read and there wasn't enough evidence to make a positive decision. They typically have to read all the relevant documents to make a negative decision. For this reason, a high false positive rate is a big problem.

This project tries to attack the problem as a sequence classification task. The sequence is a stream of tokens (probably with sentence and document termination tokens). The machine would need to learn to differentiate positive and negative sequences. Dai and Quoc (2015) trained a sequence auto-encoder for a similar task, used it to train a sequence classifier and found that the pre-training improved performance of the classifier.

Ideally, there would be an active learning element where an unsupervised model was tuned to the needs of a researcher to enable the machine to learn to classify from just a few examples. Dai and Quoc's work does not have an active learning component.

The baseline to compare against is the vet's current approach (looking for all occurrences of the disease term), which generates large numbers of false positives. It is a weak baseline.

Annotated data for this problem is available, in the order of tens of thousands of documents.

Literature:

Dai, Andrew M., and Quoc V. Le. 2015. Semi-Supervised Sequence Learning. arXiv:1511.01432 [cs], November. http://arxiv.org/abs/1511.01432.