ACS student project suggestions 2014/15 -- Simone Teufel
Project 1: Automatic Induction of a Scientific Sentiment Lexicon
Project Originator
Simone Teufel
Supervisor
Simone Teufel
Project Description
Sentiment detection in scientific discourse is different concept from
sentiment detection in, for instance, movie or product reviews, where
an artefact is directly evaluated. Negative sentiment in science,
however, typically corresponds to a problematic situation or a
problem-solving activity that fails. What exactly establishes the
problem can take many aspects and is hard to recognise automatically,
because simple features such as "good", "bad" or "unable" are
rare. However, there are two aspects that can help: mutual constraints
amongst all possible sentiment lexicon candidates observed in the
text, and indicators from the discourse context.
Lu et al. (2011) describe an approach that utilises the first
intuition. They use an integer programming (optimization)
approach to semantic lexicon construction for product reviews, which
uses information about antonymy and synonymy from a thesaurus such as
WN, negation and other heuristics such as coordination of sentiment
phrases to induce a sentiment lexicon.
The second helping hand is not available in product reviews, but is
available in scientific writing: discourse analysis a la Argumentative
Zoning. Argumentative Zoning is a method of discourse analysis that
uses supervised or unsupervised ML to detect stages of argumentation
in scientific text. Analysis takes the form of sentence-based
classification. Features include lexical indicators, sequence
information, location, citation and verb-syntactic features. Generally
bad or good expressions can also be derived from looking at patterns
such as "too X", "not enough X" or expressions of "right amount of".
The student will reimplement Lu et al.'s approach and use AZ status as
an additional information source. Time permitting, the lexicon induced
this way can be further improved by bootstrappling. In bootstrapping,
preliminary AZ analysis is performed to induce the first sentiment
lexicon, which in turn leads to a better AZ classification and so on.
This will lead to an improvement of AZ analysis within one domain, and
possibly support the porting of AZ to other domains.
An existing AZ system, which includes a simple sentiment lexicon,
provides the starting point. The AZ system also guarantees a simple
way of performing extrinsic evaluation. Intrinsic evaluation of the
quality of the sentiment lexicon is therefore not necessarily required
in this project, although validation of the sentiment lexicon by a
human experiment would constitute an even better evaluation of the
project outcome.
References
- Lu, Castellanos, Dayal, Zhai (2011): Automatic construction of a
context-aware sentiment lexicon: an optimization approach. WWW '11
Proceedings of the 20th international conference on World wide web
Pages 347-356
- Teufel and Moens (2002): Summarizing scientific articles:
experiments with relevance and rhetorical status. Computational
linguistics.
- Agichtein and Gravano (2000): Snowball: extracting
relations from large plain-text collections. DL '00 Proceedings of
the fifth ACM conference on Digital libraries Pages 85-94
Project 2: Domain Adaptation for Argumentative Zoning
Project Originator
Diarmuid O'Seaghdha
Supervisor
Diarmuid O'Seaghdah and Simone Teufel
Project Description
In scientific writing, each part of a text has a specific role to play
in building the narrative the writer is trying to communicate. For
example, different parts may introduce the topic and motivate its
importance, describe previous work and its shortcomings, describe the
authors' own contribution or present a conclusion. Argumentative
zoning (AZ; Teufel, 2010) is the task of detecting the parts of
scientific articles that perform specific rhetorical functions. It has
been shown that AZ annotation can improve scientific summarisation and
can speed up literature browsing by domain experts.
Automatic AZ annotation is generally treated as a supervised learning
task: a statistical classifier is trained on a set of manually
annotated texts and learns to predict zones for unseen texts. The
features used by the model are a combination of lexical items and
information about the text structure. When the unseen data is similar
to the training data, AZ classifiers can perform relatively
well. However, linguistic and structural conventions can vary greatly
even within subfields of the same scientific discipline (e.g., NLP vs
theoretical CS). A system trained on annotated NLP papers will have
trouble labelling a paper from Nature Genetics. This is a particular
instance of a general issue that arises in many areas of NLP (and
across machine learning), from parsing to sentiment analysis. "Domain
adaptation" is the name given to the task of adapting a statistical
classifier to data which is different from the data it was trained on.
Many approaches to domain adaptation have been proposed in the
literature, and the goal of this project will be to investigate
whether some of these approaches can successfully be applied to
argumentative zoning. One simple approach involves augmenting the
feature space with in-domain and cross-domain copies of each feature
(Daumé III, 2007); another approach uses lexical representations
learned from large corpora using clustering (Koo et al, 2008), topic
models (Guo et al, 2009; Eidelman et al, 2012) or neural networks
(Glorot et al, 2011); yet another generates training data from system
predictions through iterative self-training (McClosky and Charniak,
2008). The experimental setup will use pre-existing AZ datasets from
two scientific genres (computer science and chemistry) for training
and evaluation.
References:
References
- Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of ACL-07. Prague, Czech Republic.
-
Vladimir Eidelman, Jordan Boyd-Graber and Philip Resnik. 2012. Topic models for dynamic translation model adaptation. In Proceedings of ACL-12. Jeju, Korea.
- Xavier Glorot, Antoine Bordes and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of ICML-11. Bellevue, WA.
- Terry Koo, Xavier Carreras and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Proceedings of ACL-08. Columbus, OH.
- Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu and Zhong Su. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of ACL-09. Suntec, Singapore.
- David McClosky and Eugene Charniak. 2008. Self-training for biomedical parsing. In Proceedings of ACL-08. Columbus, OH.
- Simone Teufel. 2010. The Structure of Scientific Articles:
Applications to Citation Indexing and Summarization. CSLI
Publications, Stanford, CA.
Project 3: Identifying Deixis to Communicative Artifacts in Text
Project Originator
Simone Teufel and Shomir Wilson
Project Supervisor
Simone Teufel and Shomir Wilson
Project Description
This project looks at an aspect of discourse (anaphora and
coreference), which currently often causes problems for the practical
processing of scientific texts, namely deixis.
Texts of scholarly papers frequently mention items such as
illustrations, typesetting elements (e.g., sections and lists), and
discourse entities. Some mentions contain identifiers for their
referents (such as "in Figure 1" or "in Section 3") but many other
mentions connect with their referents less explicitly, using deictic
phrases like "the figure above", "this section", or "those ideas".
Deictic mentions make a connection between information represented in
non-linguistic forms and the meaning of text, but they rely on the
reader to select the proper referent for the connection.
Preliminary work has shown the richness and diversity of deictic
mentions in textbooks (Wilson and Oberlander 2014). We hypothesize that
scholarly papers are similarly fertile, and that deictic mentions have a
substantial role in the rhetorical structure of a paper. Separately,
analysis of the rhetorical structure of scientific research articles has
shown how certain "landmark sentences" (often containing deictic
mentions) serve as signposts for the rhetorical functions of passages.
However, the role of deictic mentions in rhetorical structure has not
been explored.
The student will start with an existing corpus of scientific documents
annotated with rhetorical markers. The student will use a dependency
parser to search the corpus for simple patterns in sentence structure
that indicate occurrences of deictic mentions. They will use the results
of this search to answer one or more of the following questions:
- What is the relationship between sentences that signpost rhetorical
structure and deictic mentions? Do the two correlate closely, or are
they separate phenomena that happen to overlap?
- How often are the referents of deictic mentions subject to
vagueness? In particular, how often is it impossible to precisely
delimit the part of a document that a deictic mention refers to? Is
this vagueness useful, i.e., does it help the writer communicate
something that precision would not?
- Which methods can we use to automatically disambiguate deicic
mentions from signposting mentions? There is a range of supervised
and unsupervised approaches available, e.g. Kim and Webber (2006)
- When deictic mentions occur as part of rhetorical signposting,
how difficult is it to automatically determine the type of the
referent (whether it is a sentence, a section of the paper, or
something else)? Are simple vocabulary-based heuristics sufficient,
or is some deeper understanding of semantics or pragmatics
necessary? This task is a typical coreference task (eg. CoNLL 2012),
and also touches on aspects of information structure (Roesiger and
Teufel, 2014).
References
- Wilson and Oberlander (2014). Determiner-established deixis to
communicative artifacts in pedagogical text. In Proceedings of the
52nd Annual Meeting of the Association for Computational
Linguistics, pp409--414. June 22-27, Baltimore, MD.
- Kim and Webber (2006). Automatic Reference Resolution in
Astronomy Articles. Proceedings of 20th CODATA conference
- I. Roesiger and S. Teufel. 2014. Resolving Coreferent and
Associative Noun Phrases in Scientific Text. In: Proceedings of
EACL-14, Gothenburg, Sweden.
Project 4: Towards an Automatic Model of Argumentation
Scientific argumentation can be seen as a sequence of speech acts
operating in an argumentation game, where the highest-level goal is
the justification of the current paper. Intermediate goals are the
sub-argument that the work presented is novel, or that the work
presented constitutes an improvement over existing work. The
recognition of the speech acts in a robust manner will lead to the
higher-level intentions and eventually the overall goal.
(This is what an argumentation graph
looks like.)
The larger framework in which this project operates is that of a
spreading activation network, the model of argumentation, which
encodes the model's certainty that a speech act was performed, and the
connections between goals or argumentation steps.
The particular goal of this project is the detection of mentions of
other work -- citations, author names, names of approaches associated
with the authors, possessive and personal pronouns. A mention is a
noun phrase in any given sentence. For each noun phrase, we need the
following two factors:
- "grounding" in terms of a link to one
or more of the citations listed at the end of the paper
- a probability that expresses the certainty that the noun phrase
encountered in a given sentence actually does refer to that citation.
Evaluation will compare existing sentences pre-annotated (known to be
containing moves involving existing works) with the system's first
choice for each noun phrase. 200,000 sentences pre-annotated at the
sentence level exit, but the student may have to do some annotation
him/herself.
The student choosing this project will have to make their own choices
wrt machine learning algorithm and creatively develop an algorithm for
a new task definition which is located between named entity
recognition, citation classification and coreference resolution.
References
- Siddharthan and Teufel (2007). Whose idea was this and why does
it matter: Attributing scientific work to citations'. in Proceedings
of the Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL-HLT 2007). ACL,
Morristown, NJ, USA, pp. 316-323, Human Language Technologies: The
Annual Conference of the North American Chapter of the Association
for Computational Linguistics (NAACL-HLT 2007), Rochester, New York,
United States, 22-27 April.
Project 5: Automatic Identification of Creativity and
Innovativeness in Scientific Writing
Project Description
This project proposes the development of a
computational-linguistically based indicator of innovativeness, in
order to improve bibliometric assessment of science. Bibliometrics is
the science of assessing the quality of the research output of
researchers or universities on the basis of their research output, for
instance the UK's Research Excellence Framework [1]. A related task
is IARPA's FUSE program [1], which seeks to detect emerging
opportunities in science and technology as early as possible. Its
fundamental hypothesis is that real-world processes of technical
emergence leaves discernible traces in the public scientific,
technical and patent literature. Most of the current science
indicators are citation-based. The degree of innovativeness of a
paper is an aspect of emergence that is closely related to this idea.
It is commonly believed that high impact papers are
innovative. However, some highly cited papers are conforming, document
incremental research and tend to reinforce the status quo [3].
Innovativeness cannot therefore be assessed by
purely looking at citation count. One can try to approach the problem
of identifying innovative scientific papers using citation networks to
[3, 4]. This approach is based on the idea that these papers disrupt
the existing citation structure of the topic. It has also been long
assumed that access to full text would result in better innovation
finding. This is examplified by a related problem of identifying "paradigm shifts" [5]. The current project follows along
this research avenue, and attempts to add information about sentences
such as the following to the search for innovativeness:
This result challenges the claims of recent discourse theories
(Grosz and Sidner 1986; Reichman 1985) which argue for a the close
relation between cue words and discourse structure.
Our US collaborators Richard Klavans and Kevin Boyack have
completed a survey where highly influential biomedical scientists
rated 10 of their high-cited papers. Despite the deeply subjective
nature of innovativeness, the authors themselves are certainly best in
a position to assess how innovative their own papers are, if the
self-elicitation is performed in an honest, trusted manner, where the
reputation of the informant is not threatened. Klavans and Boyak
achieved this by asking only about those papers which are of
high-impact anyway. The data from this survey now puts us in the
position that we can tag these 1200 papers as being innovative,
progressive, or mediocre. The outcome of this project will be a system
that is able to automatically assess the level of innovation of an
unseen paper.
The core of this project, after having made the AZ implementation run
on the biomedical corpus, is to find a correlation between the
rhetorical "footprint" of a paper (derived via AZ) and its level of
innovation. The "rhetorical footprint" will be based on AZ-based
features, which are fed into a machine learning system that correlates
these features to the papers' innovativeness status.
Practicalities
Most of the corpus is already acquired in full text and has been
transformed into a uniform XML format (SciXML). The first step of this
project is to unify the rest of the corpus into SciXML. The existing
implementation of Argumentative Zoning ([6,7] cf. other project
descriptions) can then be run on the new medical corpus.
AZ currently relies on supervised machine learning. It has been
trained on annotated articles from computational linguistics and
chemistry [7]. However, the mentioned corpus we will use for learning
innovation contains articles from biomedical science. An early stage
of this project will therefore assess whether the classification of
the AZ system trained on chemistry and CL is adequate for helping in
the innovativeness classification, or whether its lexical resources
need to be manually adapted.
References
- [1] Research excellence framework (REF). http://www.ref.ac.uk/background/
bibliometrics/.
- [2] D. A. Murdick, Foresight and understanding from
scientific exposition (FUSE).
http://www.iarpa.gov/Programs/ia/FUSE/fuse.html.
- [3] R. Klavans, K. W. Boyack, A. A. Sorensen, and C. Chen, Towards
the development of an indicator of conformity,
- [4] C. Chen, Y. Chen, M. Horowitz, H. Hou, Z. Liu, and
D. Pellegrino, Towards an explanatory and computational theory of
scientific discovery, Journal of Informetrics, vol. 3, no. 3,
pp. 191-209, 2009.
- [5] F. Lisacek, C. Chichester, A. Kaplan, and A. Sandor,
Discovering paradigm shift patterns in biomedical abstracts:
application to neurodegenerative diseases, In: first international
symposium on semantic mining in biomedicine, pp. 11-13, Citeseer,
2005.
- [6] S. Teufel and M. Moens, Summarizing scientific articles:
experiments with relevance and rhetorical status, Computational
linguistics, vol. 28, no. 4, pp. 409--445, 2002.
- [7] S. Teufel, A. Siddharthan, and C. Batchelor, Towards
discipline-independent argumentative zoning: Evidence from chemistry
and computational linguistics, in: Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing: Volume
3, pp. 1493--1502, Association for Computational Linguistics, 2009
Last modified: Tue Oct 28 17:09:03 GMT 2014