ACS student project suggestions 2015/16 -- Simone Teufel
Project 1: Sentence Alignment for Summarisation
Project Description
This project provides an important prerequisite for supervised machine learning
methods, namely an alingment of sentences from the abstract with
sentences from the document body of scientific articles.
One form of state-of-the-art summarisation relies on supervised
machine learning of lexical statistics from pairs of sentences from
abstracts and sentences from documents. This project develops a method
of determining the best alignment of such sentence pairs, from a large
set of models of semantic similarity.
The project combines distributional models of sentence similarity with
discourse-derived information to find the best overlap of sentences.
Marcu (1999) proposes a clause-based method of determining where in a
text and abstract clause comes from, based on cosine similarity.
The project will reimplement the Marcu model, but test replacing the cosine
similarity with the Longest Common Substring and with various
distributional models of clause similarity.
80 papers' abstract sentences have been manually annotated with
sentences in the abstract (but not at clause-level). This corpus can
be used for evaluation.
Literature
-
Daniel Marcu (1999). The automatic construction of large-scale corpora
for summarization research.The 22nd International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'99), pages
137-144, Berkeley, CA, August 1999.
Project 2: Automatic Model of Scientific Argumentation
Project Description
Scientific argumentation can be seen as a sequence of speech acts
operating in an argumentation game, where the highest-level goal is
the justification of the current paper. Intermediate goals are the
sub-argument that the work presented is novel, or that the work
presented constitutes an improvement over existing work. Successful
recognition of speech acts will lead to the
higher-level intentions and eventually the overall goal.
(This is what an argumentation graph
looks like.)
We are particularly interested in a particular kind of speech act
here, those involving mentions of other work -- citations, author
names, names of approaches associated with the authors, possessive and
personal pronouns, and statements about the relationship between this
mention and the current paper. For each noun phrase, we need the
following two factors:
- "grounding" in terms of a link to one or more of the citations
listed at the end of the paper
- a probability that expresses the certainty that the noun phrase
encountered in a given sentence actually does refer to that citation.
- a classification as one of the 23 listed speech acts from
Teufel(2010) in which the noun phrase participates
- a probability that expresses the certainty that the speech act
was indeed found in the paper.
Evaluation will compare existing sentences pre-annotated (known to be
containing moves involving existing works) with the system's first
choice for each noun phrase. 200,000 sentences pre-annotated at the
sentence level exit, but the student may have to do some annotation
him/herself.
The student choosing this project will have to make their own choices
wrt machine learning algorithm and creatively develop an algorithm for
a new task definition which is located between named entity
recognition, citation classification and coreference resolution.
Literature
- Siddharthan and Teufel (2007). Whose idea was this and why does
it matter: Attributing scientific work to citations'. in Proceedings
of the Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL-HLT 2007). ACL,
Morristown, NJ, USA, pp. 316-323, Human Language Technologies: The
Annual Conference of the North American Chapter of the Association
for Computational Linguistics (NAACL-HLT 2007), Rochester, New York,
United States, 22-27 April.
Project 3: Automatic Identification of Creativity and
Innovativeness in Scientific Writing
Project Description
This project proposes the development of an indicator of innovativeness, in
order to improve bibliometric assessment of science. Bibliometrics is
the science of assessing the quality of the research output of
researchers or universities on the basis of their research output, for
instance the UK's Research Excellence Framework [1]. A related task
is IARPA's FUSE program [1], which seeks to detect emerging
opportunities in science and technology as early as possible. Its
fundamental hypothesis is that real-world processes of technical
emergence leaves discernible traces in the public scientific,
technical and patent literature. Most of the current science
indicators are citation-based. The degree of innovativeness of a
paper is an aspect of emergence that is closely related to this idea.
It is commonly believed that high impact papers are
innovative. However, some highly cited papers are conforming, document
incremental research and tend to reinforce the status quo [3].
Innovativeness cannot therefore be assessed by
purely looking at citation count. One can try to approach the problem
of identifying innovative scientific papers using citation networks to
[3, 4]. This approach is based on the idea that innovative papers
maximally disrupt
the existing citation structure of the topic.
It has also been long assumed that access to full text would result in
better innovation finding. This is examplified by a related problem of
identifying "paradigm shifts" [5]. The current project follows along
this research avenue, and attempts to add information about sentences
such as the following to the search for innovativeness:
This result challenges the claims of recent discourse theories
(Grosz and Sidner 1986; Reichman 1985) which argue for a the close
relation between cue words and discourse structure.
Our US collaborators Richard Klavans and Kevin Boyack have
completed a survey where highly influential biomedical scientists
rated 10 of their high-cited papers. Despite the deeply subjective
nature of innovativeness, the authors themselves are certainly best in
a position to assess how innovative their own papers are, if the
self-elicitation is performed in an honest, trusted manner, where the
reputation of the informant is not threatened. Klavans and Boyak
achieved this by asking only about those papers which are of
high-impact anyway. The data from this survey allows us to classify
the 1200 papers as being innovative, progressive, or mediocre.
The methodology followed in this project relies on performing
Argumentative Zoning classification on the corpus of full-text for the
1200 papers, and to find a correlation between the
rhetorical "footprint" of a paper (derived via AZ) and its level of
innovation. The "rhetorical footprint" will be based on AZ-based
features, which are fed into a machine learning system that correlates
these features to the papers' innovativeness status.
Practicalities
Most of the corpus is already acquired in full text and has been
transformed into a uniform XML format (SciXML). The first step of this
project is to unify the rest of the corpus into SciXML. The existing
implementation of Argumentative Zoning ([6,7] cf. other project
descriptions) can then be run on the new medical corpus.
AZ currently relies on supervised machine learning. It has been
trained on annotated articles from computational linguistics and
chemistry [7]. However, the mentioned corpus we will use for learning
innovation contains articles from biomedical science. An early stage
of this project will therefore assess whether the classification of
the AZ system trained on chemistry and CL is adequate for helping in
the innovativeness classification, or whether its lexical resources
need to be manually adapted.
Literature
- [1] Research excellence framework (REF). http://www.ref.ac.uk/background/
bibliometrics/.
- [2] D. A. Murdick, Foresight and understanding from
scientific exposition (FUSE).
http://www.iarpa.gov/Programs/ia/FUSE/fuse.html.
- [3] R. Klavans, K. W. Boyack, A. A. Sorensen, and C. Chen, Towards
the development of an indicator of conformity,
- [4] C. Chen, Y. Chen, M. Horowitz, H. Hou, Z. Liu, and
D. Pellegrino, Towards an explanatory and computational theory of
scientific discovery, Journal of Informetrics, vol. 3, no. 3,
pp. 191-209, 2009.
- [5] F. Lisacek, C. Chichester, A. Kaplan, and A. Sandor,
Discovering paradigm shift patterns in biomedical abstracts:
application to neurodegenerative diseases, In: first international
symposium on semantic mining in biomedicine, pp. 11-13, Citeseer,
2005.
- [6] S. Teufel and M. Moens, Summarizing scientific articles:
experiments with relevance and rhetorical status, Computational
linguistics, vol. 28, no. 4, pp. 409--445, 2002.
- [7] S. Teufel, A. Siddharthan, and C. Batchelor, Towards
discipline-independent argumentative zoning: Evidence from chemistry
and computational linguistics, in: Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing: Volume
3, pp. 1493--1502, Association for Computational Linguistics, 2009
Project 4: Improving the Output of a Proposition-based Summariser
Project Originator
Simone Teufel
Project Supervisor
Simone Teufel and Yimai Fang
Project Description
This project aims to create more grammatical output for an existing
prototype summariser
which is based on propositions -- shallow semantic representations, which
the summariser uses to build a simple discourse model. On the basis of
this model, it can decide which propositions are the texts' most important ones.
And this is where the trouble begins. What this project is addressing
is what happens once these propositions have been chosen, not the main
mechanism of the summariser. The current, clearly suboptimal, output solution is to print the
lexical part of the selected propositions in text-order. This produces
texts such as:
Some of the beetles, which spend their lives eating and breeding in
dung. 4,000 species have evolved to climates the dung of
animals. The soft cattle dung in which flies flies
breed. Dung-breeding flies. A time, into cow pats.
This text was produced by simple extraction from propositions such as
(ranked by importance):
- 6(1.13) spend (which; lives, eating)
- 12(1.21) in (eating; dung)
- 83(7.24) into (a time; cow pats)
- 11(1.19) and (eating, breeding)
- 9(1.15+) POSSESS (their; lives)
- 21(2.22) of (the dung; animals)
- 15(2.10) have evolved (4,000 species; to: climates, to: the dung)
- 1(1.2) of (Some; the beetles)
The summariser is very good at handling propositions and beats other
summarisers on its content extraction ability, but for human
consumption the output is too rough.
One might ask oneself why we choose to output text that is at
proposition-level, if it produces texts of such low quality, rather
than extracting sentences, which is obviously much easier. The reason is
that the proposition-level unit gives our summariser an edge over
sentence-extractors, which cannot bundle information as tightly as
this summariser can. However, there are many ways how this can be done
better
than by lexeme extraction. The task of the student who chooses this project
is to make the output smoother.
The overal structure of this program will be as an
overgenerate-and-rank model.
This project will use a subcategorisation lexicon and the n-gram
method of shallow generation to produce sentences from the
propositions which:
- are grammatical (or at least more grammatical than the current
solution);
for this we will use a subcategorisation lexicon (Korhonen et
al. 2003)
- read naturally without distorting the meaning of the original
text (variations of shallow generation can be used for this, e.g.,
by ngram-models (Langkilde and Knight 2002) or by a knapsack generation
algorithm (Nishikawa et al. 2014).
Literature
- Y. Fang, S. Teufel. 2014. A summariser based on human memory limitations and lexical competition. In: Proceedings of EACL-14, Gothenburg, Sweden.
- Anna Korhonen, Yuval Krymolowski and Ted Briscoe (2006). A Large
Subcategorization Lexicon for Natural Language Processing
Applications. In Proceedings of the 5th International Conference on
Language Resources and Evaluation. Genova, Italy.
-
H Nishikawa, K Arita, K Tanaka, T Hirao, T Makino, Y Matsuo, Learning to Generate Coherent Summary with Discriminative Hidden Semi-Markov Model
The 25th International Conference on Computational Linguistics (Coling 2014)
Project 5: Is Fido really sick? Sequence learning applied to Disease
Indicators in a veterinary context
Project Originator
Noel Kennedy
Project Supervisor
Simone Teufel and Noel Kennedy
Project Description
This project will improve an information retrieval system that is
currently in use at the Royal Veterinary College in London.
The IR system indexes the clinical data in
the VetCompass project.
VetCompass is a not-for-profit organisation which seeks to improve
animal welfare by improving the understanding of animal
diseases. VetCompass holds clinical records for 4m animals and the IR
system indexes around 130m documents. This project addresses a key
problem in clinical research (including human clinical research). In
contrast to human medical data, vetinary data access is magnitudes
easier and cheaper.
The particular problem addressed is the following: a vet is searching
for cases of patients that have a certain disease, and enter
variations of the disease name, i.e. 'diabetes' or 'dm'. If a case note
(document) contains a mention of a particular disease, there is only a
33% chance that the patient actually has that disease, i.e., the False
Positive (FP) rate is 67%. In those cases, the
disease appeared in a negated, hypothetical or attributed
context. The True Positive (TP) rate varies among different diseases
and ranges from 12% to 63%. This project will collect evidence about
disease references, in particular the sequence in which related
evidence occurs, in order to improve this situation and lower the FP rate.
The vets only look at documents which contain at least one
disease-relevant token. They use domain knowledge to interpret at
least one whole document per patient. They are looking for enough
cumulative evidence from multiple sentences, which they use to
determine if the patient meets their criteria or not. The researcher
makes the classification decision, for each patient, at potentially
two different points:
- A positive decision is made after reading enough relevant
documents with enough positive evidence that warrants classifying the
patient as a positive case. This point is typically reached part-way
through all the relevant documents for each patient.
- A negative decision can only be reached after the last relevant
document was read and there wasn't enough evidence to make a
positive decision. They typically have to read all the relevant
documents to make a negative decision. For this reason, a high false
positive rate is a big problem.
This project tries to attack the problem as a sequence classification
task. The sequence is a stream of tokens (probably with sentence and
document termination tokens). The machine would need to learn to
differentiate positive and negative sequences. Dai and Quoc (2015)
trained a sequence auto-encoder for a similar task, used it to train a
sequence classifier and found that the pre-training improved
performance of the classifier.
Ideally, there would be an active learning element where an
unsupervised model was tuned to the needs of a researcher to enable
the machine to learn to classify from just a few examples.
Dai and Quoc's work does not have an active learning component.
The baseline to compare against is the vet's current approach
(looking for all occurrences of the disease term), which generates
large numbers of false positives. It is a weak baseline.
Annotated data for this problem is available, in the order of tens of
thousands of documents.
Literature:
-
Dai, Andrew M., and Quoc V. Le. 2015. Semi-Supervised Sequence
Learning. arXiv:1511.01432 [cs],
November. http://arxiv.org/abs/1511.01432.