The meaning of many adjectives is highly context dependent. For example, 'cool' can mean different things depending on the context, e.g.
The project would suit a student who is interested in machine learning and its application to NLP tasks. Programming language of choice.
Lin Sun and Anna Korhonen. 2009. Improving Verb Clustering with Automatically Acquired Selectional Preferences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore.
Blei, D.M. and Ng, A.Y. and Jordan, M.I. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research. (3). 993-1022.
Brody, S. and Lapata, M. 2009. Bayesian word sense induction. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 103-111.
Agirre, E. and Martinez, D. de Lacalle, O.L. and Soroa, A. 2006. Two graph-based algorithms for state-of-the-art WSD. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 585-593.
Sauper, S., Haghighi, A., and Barzilay, R. 2010. Incorporating Content Structure into Text Analysis Applications. Proceedings of EMNLP.
Can Twitter data be thought of as a new genre of English? The constraints of the character limit as well as the realtime, connected nature of the medium mean that people write quite differently on Twitter than in other genres, possibly closer to the way they talk than the way they normally write. One feature we might expect to find is that sentence subjects become optional in Twitter, unlike in standard English. This project will involve studying the use of verbs in Twitter and addressing the challenges it poses for NLP. The project will involve
Programming language of choice.
Tim Van de Cruys, Laura Rimell, Thierry Poibeau and Anna Korhonen. 2011. Multi-way Tensor Factorization for the Unsupervised Induction of Subcategorization Frames.
Cedric Messiant. 2008. A Subcategorization Acquisition System for French Verbs. Proceedings of ACL.
Improved understanding of second language acquisition (SLA) is critical for developing more useful applications for language learning and teaching. In attempting to understand SLA, large data sources containing naturalistic learner language (i.e. learner corpora) are essential. The new EF-Cambridge Open English Learner Database (EFCamDat), developed at the University of Cambridge , provides millions of scripts representing a wide range of topic areas, written by over 700,000 student attending the online school of Education First (EF) worldwide . Its size is predicted to grow into 100 million words by 2014, making it by far the largest learner corpus available.
For any practical application, this corpus must be processed automatically using Natural Language Processing (NLP) technology. This project will investigate the challenges involved in applying NLP techniques to imperfect learner data, and ways of overcoming them.
NLP tools have been highly successful in the automatic tagging and parsing of native English. However, the suitability of such tools for language produced by learners of English or fluent non-native speakers of English has not been sufficiently explored . We have started looking into how part-of-speech taggers and parsers trained to deal with native English can be used to obtain reliable linguistic information on word types and sentence structure. Our preliminary investigations have revealed interesting patterns. For instance, a statistical parser may be more robust to learner errors than a rule-based parser that uses a highly specific grammar of English. At the same time, the latter parser may be excellent in identifying learner errors.
For instance, in the sentence "I never do the laundry and mop de floor" both "mop" and "de" are usually tagged as foreign nouns. Interestingly though, some parsers ignore such mistakes as in "I finally ger an offer..." by correctly assigning a verbal tag to the erroneous "ger" based on grammatical expectations.
This project will investigate the following issues:
Familiarity with machine learning toolkits such as Weka and Mallet is useful (but not necessary) for this project.
 EF-Cambridge Open English Learner Database: http://www.mml.cam.ac.uk/dtal/research/EF/corpus.html .
 EF Englishtown: https://www.englishtown.com/
 Julia Krivanek and Detmar Meurers. 2011. Comparing Rule-Based and Data-Driven Dependency Parsing of Learner Language. In Proceedings of the International Conference on Dependency Linguistics.
 Theodora Alexopoulou, Jeroen Geertzen, Anna Korhonen and Detmar Meurers. 2012. L1 effects in L2 English relative clauses: Evidence from corpus production. EuroSLA22 - 22nd Annual Conference of the European Second Language Association.
Text mining has the potential to yield significant benefits in biomedical research. Among the most appealing ones are the ability to unlock hidden information in large collections of scientific literature, to develop new knowledge, and to improve the efficiency and quality of research process. However, a number of research challenges need to be addressed before these benefits can be fully realised. One of the most important challenges is to improve the portability of text mining. Because biomedicine shows significant sub-domain variation (Lippincott et al., 2011), techniques optimised for specific areas of biomedicine (e.g. molecular biology) do not necessarily port well to others (e.g. public health, neurology, physiology). Most current techniques rely on manually annotated, domain-specific datasets which are costly to obtain. Researchers have therefore begun to investigate the use of domain adaptation which uses labeled data from one (or several) source domains to learn a hypothesis performing well on a different domain for which no (or only little) labeled data is available (Blizer and Haume, 2010). Although some prior work exists in domain adaptation of biomedical text processing and mining (e.g. Miwa et al., 2012), there are many unexplored areas. This project will investigate domain adaptation in one of these areas, for example:
This project would suit a student who is interested in machine learning and in applying NLP to biomedicine. Familiarity with machine learning toolkits such as Weka and Mallet would be useful (but is not necessary). Programming language of choice.
Tom Lippincott, Diarmuid O Seaghdha and Anna Korhonen. 2011. Exploring subdomain variation in biomedical language. BMC Bioinformatics 12:212.
John Blizer and Dal Haume III. 2010. ICML 2010 Tutorial on Domain Adaptation: http://adaptationtutorial.blitzer.com/
Miwa, M., Thompson, P. and Ananiadou, S. (2012). Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 2012.
Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, and Ulla Stenius. 2011. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 69(12).
Anna Korhonen, Diarmuid O Seaghdha, Ilona Silins, Lin Sun, Johan Hogberg and Ulla Stenius. 2012. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS ONE 7(4):e33427.