Andreas Vlachos's ACS project suggestions for 2012-2013

Machine translation for semantic parsing

Proposer: Andreas Vlachos, Stephen Clark
Supervisor: Andreas Vlachos, Stephen Clark
Special Resources: None

Task Description

Semantic parsing is the task of interpreting utterances in natural language (NL) into machine interpretable meaning representations (MRs), so that an appropriate response can be returned. In the example, in the context of a geographical information system (GIS), the NL question is converted into an appropriate MR so that the answer can be obtained from a database:

NL: How many cities are there in the US?
MR: answer(count(city(loc 2(countryid(usa)))))

Recent work has developed algorithms for a variety of MR languages in a variety of contexts (GIS, flight and restaurant bookings, etc), as well as natural languages other than English [1,2,3].

Semantic parsing can be seen as a form of machine translation (MT), from English (or some other NL) to the MR language at question [4]. In developing such an approach, we can take advantage of the recent advances in statistical methods for MT. Furthermore, unlike most recent work including [4], such an approach is likely to generalize well to more than one MR language or domain. Potential difficulties might arise due to the relatively small training datasets used in semantic parsing, as well as the differences between mahine-interpretable MRs and human languages. Identifying and exploiting the strengths of MT methods while addressing their weaknesses in the context of semantic parsing is the focus of this project. The desired output is a semantic parser that can generalize across MR languages, natural languages and domains.

Plan of action

We have access to datasets in all the domains mentioned earlier, one of which is in available in eight natural languages. Furthermore, we are in the process of annotating a new semantic parsing dataset in the context of the SpaceBook project, which the student will be encouraged to work with. In terms of machine translation toolkits, we suggest the implementation of IBM model 4 in GIZA++ as a starting point, since it is the basis of more advanced models [5]. The goal is to develop a system that has good performance on a variety datasets. Depending on the findings, more advanced models will be explored, such as Moses, Johsua, cdec, etc.

Remarks

The student is likely to take advantage of existing open-source MT toolkits, most of which are written in C++ or Java. Therefore, familiarity with these languages is needed. Semantic parsing is a research area attracting substantial interest in the community, therefore a well-researched approach is likely to result in a publication.

References

[1] Semantic Parsing with Bayesian Tree Transducers. Bevan K. Jones, Mark Johnson, Sharon Goldwater. In Proceedings of the 50th Annual Meeting of the Association of Computational Linguistics, 2012.
[2] Spoken Language Understanding from Unaligned Data using Discriminative Classification Models. Francois Mairesse, Milica Gasic, Filip Jurcicek, Simon Keizer, Blaise Thomson, Kai Yu and Steve Young. In Proceedings of ICASSP, Taipei, 2009.
[3] Lexical Generalization in CCG Grammar Induction for Semantic Parsing. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater and Mark Steedman. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, UK, 2011 .
[4] Learning for Semantic Parsing with Statistical Machine Translation, Yuk Wah Wong, Raymond J. Mooney. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference (HLT-NAACL 2006), pp. 439-446 .
[5] Statistical Machine Translation. Adam Lopez. In ACM Computing Surveys 40(3): Article 8, pages 1-49, August 2008.

Reducing feature sparsity using language models

Proposer: Andreas Vlachos, Stephen Clark
Supervisor: Andreas Vlachos, Stephen Clark
Special Resources: None

Task Description

In most natural language processing (NLP) tasks the instances are represented using lexicalized features. While such features are informative, they result in very sparse feature sets, especially when combined with other kinds of information, e.g. syntactic relations. Since most words are rare, features encountered during testing were not present in the labeled data used to train the model, which results in worse predictive performance. For example, if the task is to extract the year a building is completed, in the following sentence:

Archers' Hall was finished in 1777.

a useful feature would be the syntatic dependency path from "Archers' Hall" to "1777" via "finished". If we haven't encountered the verb "finish" in our training data in the same syntactic context, it is unlikely to predict this instance correctly. This is a well-recognized problem in the literature and a variety of approaches have been proposed ([1],[2],[3]). However, most of these are applicable to tasks in which using the words as features is enough to obtain good performance and using features combining words and syntactic information is not common. The idea in this project is to substitute words unseen in our training with ones that we have encountered so that we can improve the predictive performance. In the example above, we would like to substitute "finished" with "completed", assuming that we have encountered the latter in the same syntactic context during training. A standard way of approaching the word substitution problem is the use of a language model, i.e. a model that given a certain context can predict what the next word should be. Language models themselves are well-studied and a large variety of models and implementations exist in the literature. The student will explore the various models available, as well as the possible ways they can be used for feature sparsity reduction in the context of information extraction tasks.

Plan of action

We have datasets and state-of-the-art systems for biomedical event extraction (from the BioNLP 2011 shared task [4]) and database population, tasks for which we know that feature sparsity restricts the performance. As for language modelling, a good starting point would be the recent work by Mnih and Teh [5] that has been applied successfully to the very related task of sentence completion. Of course, the student is welcome to experiment with other tasks, systems and datasets that are of interest.

Remarks

The student is likely to take advantage of existing language modelling toolkits. In terms of programming language, the existing systems for the tasks are coded in python, therefore knowledge of this language (or willingness to learn it) is needed to take advantage of them. A well-researched and successful approach is likely to result in a publication.

References

[1] Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384-394.
[2] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142-150, Portland, Oregon.
[3] Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011). Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.
[4] Jin-Dong Kim; Yue Wang; Toshihisa Takagi; Akinori Yonezawa Overview of Genia Event Task in BioNLP Shared Task 2011.
[5] A fast and simple algorithm for training neural probabilistic language models, Andriy Mnih and Yee Whye Teh , International Conference on Machine Learning 2012.

Negation and speculation recognition for biomedical event extraction

Proposer: Andreas Vlachos
Supervisor: Andreas Vlachos, Stephen Clark
Special Resources: None

Task Description

Biomedical event extraction is the task of extracting specific types of information about proteins. For example, from the following passage:

"TRADD was the only protein that interacted with wild-type TES2 and not with isoleucine-mutated TES2."

the following events should be extracted:
E1 Binding(Theme:"TRADD", Theme:"wild-type TES2")
E2 Binding(Theme:"TRADD", Theme:"isoleucine-mutated TES2")

Similarly, from this passage:

"In this study we hypothesized that the phosphorylation of TRAF2 inhibits binding to the CD40."

the following events should be extracted:
E1 Phosphorylation(Theme:"TRAF2")
E2 Binding(Theme:"TRAF2", Theme:"CD40")
E3 Negative_regulation(Theme:E2, Cause:E1)

(More information on event extraction can be found at the BioNLP 2011 shared task [1] website).

Note that in the first passage above event E2 is negated, and in the second one event E3 is speculated upon. While this information is of importance to the users of event extraction systems, most state-of-the art systems are unable to provide it. The task itself is rarely attempted (only two participants in the BioNLP2011 shared task) as it is quite challening: the information that needs to extracted is fine-grained at the level of events and the annotated data we are provided with do not contain annotation at the lexical level, i.e. we do not know which words result in an event being characterized as speculative or negated (sometimes referred to as negation and speculation cues).

Plan of action

The aim of this project is to build a component for the state-of-the-art event extraction system of Vlachos and Craven [2] that would be able to extract this kind of information. Initially, we will explore the phenomena that are relevant to event extraction as define. This will give us a better understanding of the task and help us build an initial rule-based approach, similar to the one of Kilicoglu and Bergler [3].

Following analysis of the errors made, we will try to address them using a machine learning-based method. A baseline approach would be to represent event context using appropriate features and learn a classifier. However, such an approach is might to not work well due to sparsity issues. A more interesting way is to think of the task in terms of structured prediction, in which we first detect negation and speculation cues and then identify the event characterized by them. As discussed above, this level of annotation is unavailable. Therefore we will experiment with the search-based structured prediction framework [4] which can handle such issues and we have used successfully in order to build the event extraction system of Vlachos and Craven [2].

Remarks

The code for the current event extraction system is in Python. As the projects is likely to have substantial interaction with it, the student should be willing to work with this language. Negation and speculation detection at the level of events has been rarely attempted in the past due to its challenging nature, thus a reasonably well-performing approach is likely to result in a publication.

References

[1] Jin-Dong Kim; Yue Wang; Toshihisa Takagi; Akinori Yonezawa Overview of Genia Event Task in BioNLP Shared Task 2011
[2] Vlachos, A., Craven, M. 2011. Search-based Structured Prediction applied to Biomedical Event Extraction, in Proceedings of CoNLL at ACL, Portland.
[3] Kilicoglu, Halil and Bergler, Sabine 2009. Syntactic Dependency Based Heuristics for Biological Event Extraction, In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, NAACL, Boulder, Colorado, pp 119--127.
[4] Search-based Structured Prediction. Hal Daumé III, John Langford and Daniel Marcu. Machine Learning Journal, 2009.