ACS student project suggestions 2014/15 -- Simone Teufel

Project 1: Automatic Induction of a Scientific Sentiment Lexicon

Project Originator

Simone Teufel

Supervisor

Simone Teufel

Project Description

Sentiment detection in scientific discourse is different concept from sentiment detection in, for instance, movie or product reviews, where an artefact is directly evaluated. Negative sentiment in science, however, typically corresponds to a problematic situation or a problem-solving activity that fails. What exactly establishes the problem can take many aspects and is hard to recognise automatically, because simple features such as "good", "bad" or "unable" are rare. However, there are two aspects that can help: mutual constraints amongst all possible sentiment lexicon candidates observed in the text, and indicators from the discourse context.

Lu et al. (2011) describe an approach that utilises the first intuition. They use an integer programming (optimization) approach to semantic lexicon construction for product reviews, which uses information about antonymy and synonymy from a thesaurus such as WN, negation and other heuristics such as coordination of sentiment phrases to induce a sentiment lexicon.

The second helping hand is not available in product reviews, but is available in scientific writing: discourse analysis a la Argumentative Zoning. Argumentative Zoning is a method of discourse analysis that uses supervised or unsupervised ML to detect stages of argumentation in scientific text. Analysis takes the form of sentence-based classification. Features include lexical indicators, sequence information, location, citation and verb-syntactic features. Generally bad or good expressions can also be derived from looking at patterns such as "too X", "not enough X" or expressions of "right amount of".

The student will reimplement Lu et al.'s approach and use AZ status as an additional information source. Time permitting, the lexicon induced this way can be further improved by bootstrappling. In bootstrapping, preliminary AZ analysis is performed to induce the first sentiment lexicon, which in turn leads to a better AZ classification and so on. This will lead to an improvement of AZ analysis within one domain, and possibly support the porting of AZ to other domains.

An existing AZ system, which includes a simple sentiment lexicon, provides the starting point. The AZ system also guarantees a simple way of performing extrinsic evaluation. Intrinsic evaluation of the quality of the sentiment lexicon is therefore not necessarily required in this project, although validation of the sentiment lexicon by a human experiment would constitute an even better evaluation of the project outcome.

References

Lu, Castellanos, Dayal, Zhai (2011): Automatic construction of a context-aware sentiment lexicon: an optimization approach. WWW '11 Proceedings of the 20th international conference on World wide web Pages 347-356
Teufel and Moens (2002): Summarizing scientific articles: experiments with relevance and rhetorical status. Computational linguistics.
Agichtein and Gravano (2000): Snowball: extracting relations from large plain-text collections. DL '00 Proceedings of the fifth ACM conference on Digital libraries Pages 85-94

Project 2: Domain Adaptation for Argumentative Zoning

Project Originator

Diarmuid O'Seaghdha

Supervisor

Diarmuid O'Seaghdah and Simone Teufel

Project Description

In scientific writing, each part of a text has a specific role to play in building the narrative the writer is trying to communicate. For example, different parts may introduce the topic and motivate its importance, describe previous work and its shortcomings, describe the authors' own contribution or present a conclusion. Argumentative zoning (AZ; Teufel, 2010) is the task of detecting the parts of scientific articles that perform specific rhetorical functions. It has been shown that AZ annotation can improve scientific summarisation and can speed up literature browsing by domain experts.

Automatic AZ annotation is generally treated as a supervised learning task: a statistical classifier is trained on a set of manually annotated texts and learns to predict zones for unseen texts. The features used by the model are a combination of lexical items and information about the text structure. When the unseen data is similar to the training data, AZ classifiers can perform relatively well. However, linguistic and structural conventions can vary greatly even within subfields of the same scientific discipline (e.g., NLP vs theoretical CS). A system trained on annotated NLP papers will have trouble labelling a paper from Nature Genetics. This is a particular instance of a general issue that arises in many areas of NLP (and across machine learning), from parsing to sentiment analysis. "Domain adaptation" is the name given to the task of adapting a statistical classifier to data which is different from the data it was trained on.

Many approaches to domain adaptation have been proposed in the literature, and the goal of this project will be to investigate whether some of these approaches can successfully be applied to argumentative zoning. One simple approach involves augmenting the feature space with in-domain and cross-domain copies of each feature (Daumé III, 2007); another approach uses lexical representations learned from large corpora using clustering (Koo et al, 2008), topic models (Guo et al, 2009; Eidelman et al, 2012) or neural networks (Glorot et al, 2011); yet another generates training data from system predictions through iterative self-training (McClosky and Charniak, 2008). The experimental setup will use pre-existing AZ datasets from two scientific genres (computer science and chemistry) for training and evaluation.

References:

References

Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of ACL-07. Prague, Czech Republic.
Vladimir Eidelman, Jordan Boyd-Graber and Philip Resnik. 2012. Topic models for dynamic translation model adaptation. In Proceedings of ACL-12. Jeju, Korea.
Xavier Glorot, Antoine Bordes and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of ICML-11. Bellevue, WA.
Terry Koo, Xavier Carreras and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Proceedings of ACL-08. Columbus, OH.
Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang, Xian Wu and Zhong Su. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of ACL-09. Suntec, Singapore.
David McClosky and Eugene Charniak. 2008. Self-training for biomedical parsing. In Proceedings of ACL-08. Columbus, OH.
Simone Teufel. 2010. The Structure of Scientific Articles: Applications to Citation Indexing and Summarization. CSLI Publications, Stanford, CA.

Project 3: Identifying Deixis to Communicative Artifacts in Text

Project Originator

Simone Teufel and Shomir Wilson

Project Supervisor

Simone Teufel and Shomir Wilson

Project Description

This project looks at an aspect of discourse (anaphora and coreference), which currently often causes problems for the practical processing of scientific texts, namely deixis.

Texts of scholarly papers frequently mention items such as illustrations, typesetting elements (e.g., sections and lists), and discourse entities. Some mentions contain identifiers for their referents (such as "in Figure 1" or "in Section 3") but many other mentions connect with their referents less explicitly, using deictic phrases like "the figure above", "this section", or "those ideas". Deictic mentions make a connection between information represented in non-linguistic forms and the meaning of text, but they rely on the reader to select the proper referent for the connection.

Preliminary work has shown the richness and diversity of deictic mentions in textbooks (Wilson and Oberlander 2014). We hypothesize that scholarly papers are similarly fertile, and that deictic mentions have a substantial role in the rhetorical structure of a paper. Separately, analysis of the rhetorical structure of scientific research articles has shown how certain "landmark sentences" (often containing deictic mentions) serve as signposts for the rhetorical functions of passages. However, the role of deictic mentions in rhetorical structure has not been explored. The student will start with an existing corpus of scientific documents annotated with rhetorical markers. The student will use a dependency parser to search the corpus for simple patterns in sentence structure that indicate occurrences of deictic mentions. They will use the results of this search to answer one or more of the following questions:

What is the relationship between sentences that signpost rhetorical structure and deictic mentions? Do the two correlate closely, or are they separate phenomena that happen to overlap?
How often are the referents of deictic mentions subject to vagueness? In particular, how often is it impossible to precisely delimit the part of a document that a deictic mention refers to? Is this vagueness useful, i.e., does it help the writer communicate something that precision would not?
Which methods can we use to automatically disambiguate deicic mentions from signposting mentions? There is a range of supervised and unsupervised approaches available, e.g. Kim and Webber (2006)
When deictic mentions occur as part of rhetorical signposting, how difficult is it to automatically determine the type of the referent (whether it is a sentence, a section of the paper, or something else)? Are simple vocabulary-based heuristics sufficient, or is some deeper understanding of semantics or pragmatics necessary? This task is a typical coreference task (eg. CoNLL 2012), and also touches on aspects of information structure (Roesiger and Teufel, 2014).

References

Wilson and Oberlander (2014). Determiner-established deixis to communicative artifacts in pedagogical text. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp409--414. June 22-27, Baltimore, MD.
Kim and Webber (2006). Automatic Reference Resolution in Astronomy Articles. Proceedings of 20th CODATA conference
I. Roesiger and S. Teufel. 2014. Resolving Coreferent and Associative Noun Phrases in Scientific Text. In: Proceedings of EACL-14, Gothenburg, Sweden.

Project 4: Towards an Automatic Model of Argumentation

Scientific argumentation can be seen as a sequence of speech acts operating in an argumentation game, where the highest-level goal is the justification of the current paper. Intermediate goals are the sub-argument that the work presented is novel, or that the work presented constitutes an improvement over existing work. The recognition of the speech acts in a robust manner will lead to the higher-level intentions and eventually the overall goal. (This is what an argumentation graph looks like.)

The larger framework in which this project operates is that of a spreading activation network, the model of argumentation, which encodes the model's certainty that a speech act was performed, and the connections between goals or argumentation steps.

The particular goal of this project is the detection of mentions of other work -- citations, author names, names of approaches associated with the authors, possessive and personal pronouns. A mention is a noun phrase in any given sentence. For each noun phrase, we need the following two factors:

"grounding" in terms of a link to one or more of the citations listed at the end of the paper
a probability that expresses the certainty that the noun phrase encountered in a given sentence actually does refer to that citation.

Evaluation will compare existing sentences pre-annotated (known to be containing moves involving existing works) with the system's first choice for each noun phrase. 200,000 sentences pre-annotated at the sentence level exit, but the student may have to do some annotation him/herself. The student choosing this project will have to make their own choices wrt machine learning algorithm and creatively develop an algorithm for a new task definition which is located between named entity recognition, citation classification and coreference resolution.

References

Siddharthan and Teufel (2007). Whose idea was this and why does it matter: Attributing scientific work to citations'. in Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007). ACL, Morristown, NJ, USA, pp. 316-323, Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), Rochester, New York, United States, 22-27 April.

Project 5: Automatic Identification of Creativity and Innovativeness in Scientific Writing

Project Description

This project proposes the development of a computational-linguistically based indicator of innovativeness, in order to improve bibliometric assessment of science. Bibliometrics is the science of assessing the quality of the research output of researchers or universities on the basis of their research output, for instance the UK's Research Excellence Framework [1]. A related task is IARPA's FUSE program [1], which seeks to detect emerging opportunities in science and technology as early as possible. Its fundamental hypothesis is that real-world processes of technical emergence leaves discernible traces in the public scientific, technical and patent literature. Most of the current science indicators are citation-based. The degree of innovativeness of a paper is an aspect of emergence that is closely related to this idea.

It is commonly believed that high impact papers are innovative. However, some highly cited papers are conforming, document incremental research and tend to reinforce the status quo [3]. Innovativeness cannot therefore be assessed by purely looking at citation count. One can try to approach the problem of identifying innovative scientific papers using citation networks to [3, 4]. This approach is based on the idea that these papers disrupt the existing citation structure of the topic. It has also been long assumed that access to full text would result in better innovation finding. This is examplified by a related problem of identifying "paradigm shifts" [5]. The current project follows along this research avenue, and attempts to add information about sentences such as the following to the search for innovativeness:

This result challenges the claims of recent discourse theories (Grosz and Sidner 1986; Reichman 1985) which argue for a the close relation between cue words and discourse structure.

Our US collaborators Richard Klavans and Kevin Boyack have completed a survey where highly influential biomedical scientists rated 10 of their high-cited papers. Despite the deeply subjective nature of innovativeness, the authors themselves are certainly best in a position to assess how innovative their own papers are, if the self-elicitation is performed in an honest, trusted manner, where the reputation of the informant is not threatened. Klavans and Boyak achieved this by asking only about those papers which are of high-impact anyway. The data from this survey now puts us in the position that we can tag these 1200 papers as being innovative, progressive, or mediocre. The outcome of this project will be a system that is able to automatically assess the level of innovation of an unseen paper.

The core of this project, after having made the AZ implementation run on the biomedical corpus, is to find a correlation between the rhetorical "footprint" of a paper (derived via AZ) and its level of innovation. The "rhetorical footprint" will be based on AZ-based features, which are fed into a machine learning system that correlates these features to the papers' innovativeness status.

Practicalities

Most of the corpus is already acquired in full text and has been transformed into a uniform XML format (SciXML). The first step of this project is to unify the rest of the corpus into SciXML. The existing implementation of Argumentative Zoning ([6,7] cf. other project descriptions) can then be run on the new medical corpus.

AZ currently relies on supervised machine learning. It has been trained on annotated articles from computational linguistics and chemistry [7]. However, the mentioned corpus we will use for learning innovation contains articles from biomedical science. An early stage of this project will therefore assess whether the classification of the AZ system trained on chemistry and CL is adequate for helping in the innovativeness classification, or whether its lexical resources need to be manually adapted.

References

[1] Research excellence framework (REF). http://www.ref.ac.uk/background/ bibliometrics/.
[2] D. A. Murdick, Foresight and understanding from scientific exposition (FUSE). http://www.iarpa.gov/Programs/ia/FUSE/fuse.html.
[3] R. Klavans, K. W. Boyack, A. A. Sorensen, and C. Chen, Towards the development of an indicator of conformity,
[4] C. Chen, Y. Chen, M. Horowitz, H. Hou, Z. Liu, and D. Pellegrino, Towards an explanatory and computational theory of scientific discovery, Journal of Informetrics, vol. 3, no. 3, pp. 191-209, 2009.
[5] F. Lisacek, C. Chichester, A. Kaplan, and A. Sandor, Discovering paradigm shift patterns in biomedical abstracts: application to neurodegenerative diseases, In: first international symposium on semantic mining in biomedicine, pp. 11-13, Citeseer, 2005.
[6] S. Teufel and M. Moens, Summarizing scientific articles: experiments with relevance and rhetorical status, Computational linguistics, vol. 28, no. 4, pp. 409--445, 2002.
[7] S. Teufel, A. Siddharthan, and C. Batchelor, Towards discipline-independent argumentative zoning: Evidence from chemistry and computational linguistics, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, pp. 1493--1502, Association for Computational Linguistics, 2009

Last modified: Tue Oct 28 17:09:03 GMT 2014