Negation and speculation recognition for biomedical event extraction

Proposer: Andreas Vlachos
Supervisor: Andreas Vlachos, Stephen Clark
Special Resources: None

Task Description

Biomedical event extraction is the task of extracting specific types of information about proteins. For example, from the following passage:

"TRADD was the only protein that interacted with wild-type TES2 and not with isoleucine-mutated TES2."

the following events should be extracted:
E1 Binding(Theme:"TRADD", Theme:"wild-type TES2")
E2 Binding(Theme:"TRADD", Theme:"isoleucine-mutated TES2")

Similarly, from this passage:

"In this study we hypothesized that the phosphorylation of TRAF2 inhibits binding to the CD40."

the following events should be extracted:
E1 Phosphorylation(Theme:"TRAF2")
E2 Binding(Theme:"TRAF2", Theme:"CD40")
E3 Negative_regulation(Theme:E2, Cause:E1)

(More information on event extraction can be found at the BioNLP 2011 shared task [1] website: https://sites.google.com/site/bionlpst/)

Note that in the first passage above event E2 is negated, and in the second one event E3 is speculated upon. While this information is of importance to the users of event extraction systems, most state-of-the art systems are unable to provide it. The task itself is rarely attempted (only two participants in the BioNLP2011 shared task) as it is quite challening: the information that needs to extracted is fine-grained at the level of events and the annotated data we are provided with do not contain annotation at the lexical level, i.e. we do not know which words result in an event being characterized as speculative or negated (sometimes referred to as negation and speculation cues).

Plan of action

The aim of this project is to build a component for the state-of-the-art event extraction system of Vlachos and Craven [2] that would be able to extract this kind of information. Initially, we will explore the phenomena that are relevant to event extraction as define. This will give us a better understanding of the task and help us build an initial rule-based approach, similar to the one of Kilicoglu and Bergler [3].

Following analysis of the errors made, we will try to address them using a machine learning-based method. A baseline approach would be to represent event context using appropriate features and learn a classifier. However, such an approach is might to not work well due to sparsity issues. A more interesting way is to think of the task in terms of structured prediction, in which we first detect negation and speculation cues and then identify the event characterized by them. As discussed above, this level of annotation is unavailable. Therefore we will experiment with the search-based structured prediction framework [4] which can handle such issues and we have used successfully in order to build the event extraction system of Vlachos and Craven [2].

Remarks

The code for the current event extraction system is in Python. As the projects is likely to have substantial interaction with it, the student should be willing to work with this language. Negation and speculation detection at the level of events has been rarely attempted in the past due to its challenging nature, thus a reasonably well-performing approach is likely to result in a publication.

References


[1] Jin-Dong Kim; Yue Wang; Toshihisa Takagi; Akinori Yonezawa Overview of Genia Event Task in BioNLP Shared Task 2011
[2] Vlachos, A., Craven, M. 2011. Search-based Structured Prediction applied to Biomedical Event Extraction, in Proceedings of CoNLL at ACL, Portland.
[3] Kilicoglu, Halil and Bergler, Sabine 2009. Syntactic Dependency Based Heuristics for Biological Event Extraction, In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, NAACL, Boulder, Colorado, pp 119--127.
[4] Search-based Structured Prediction. Hal Daumé III, John Langford and Daniel Marcu. Machine Learning Journal, 2009.